Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow integer types to be any range #3806

Open
andrewrk opened this issue Nov 29, 2019 · 40 comments
Open

allow integer types to be any range #3806

andrewrk opened this issue Nov 29, 2019 · 40 comments
Labels
breaking Implementing this issue could cause existing code to no longer compile or have different behavior. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@andrewrk
Copy link
Member

andrewrk commented Nov 29, 2019

Zig already has ranged integer types, however the range is required to be a signed or unsigned power of 2. This proposal is for generalizing them further, to allow any arbitrary range.

comptime {
    assert(i32 == @Int(-1 << 31, 1 << 31));
    assert(u32 == @Int(0, 1 << 32));
    assert(u0 == @Int(0, 1);
    assert(noreturn == @int(0, 0));
}

Let's consider some reasons to do this:

One common practice for C developers is to use -1 or MAX_UINT32 (and related) constants as an in-bound indicator of metadata. For example, the stage1 compiler uses a size_t field to indicate the ABI size of a type, but the value SIZE_MAX is used to indicate that the size is not yet computed.

In Zig we want people to use Optionals for this, but there's a catch: the in-bound special value uses less memory for the type. In Zig on 64-bit targets, @sizeOf(usize) == 8 and @sizeOf(?usize) == 16. That's a huge cost to pay, for something that could take up 0 bits of information if you are willing to give up a single value inside the range of a usize.

With ranged integers, this could be made type-safe:

const AbiSize = @Int(0, (1 << usize.bit_count) - 1);
const MyType = struct {
    abi_size: ?AbiSize,
};
var my_type: MyType = undefined;

test "catching a bug" {
    var other_thing: usize = 1234;

    my_type.abi_size = other_thing; // error: expected @Int(0, 18446744073709551615), found usize
}

Now, not only do we have the Optionals feature of zig protecting against accidentally using a very large integer when it is supposed to indicate null, but we also have the compile error helping out with range checks. One can choose to deal with the larger ranged value by handling the possibility, and returning an error, or with @intCast, which inserts a handy safety check.

How about if there are 2 special values rather than 1?

const N = union(enum) {
    special1,
    special2,
    normal: @Int(0, (1 << u32) - 2),
};

Here, size of N would be 4 bytes.


Let's consider another example, with enums.

Enums allow defining a set of possible values for a type:

const E = enum {
    one,
    two,
    three,
};

There are 3 possible values of this type, so Zig chooses to use u2 for the tag type. It will require 1 byte to represent it, wasting 6 bits. If you wrap it in an optional, that will be 16 bits to represent something that, according to information theory, requires only 2 bits. And Zig's hands are tied; because currently each field requires ABI alignment, each byte is necessary.

If #3802 is accepted and implemented, and the is_null bit of optionals becomes align(0), then ?E can remain 1 byte, and ?E in a struct with align(0) will take up 3 bits.

However, consider if the enum was allowed to choose a ranged integer type. It would choose @Int(0, 3). Wrapped in an optional, it actually could choose to use the integer value 3 as the is_null bit. Then ?E in a struct will take up 2 bits.

Again, assuming #3802 is implemented, Zig would even be able to "flatten" several enums into the same integer:

const Mode = enum { // 2-bit tag type
    Debug,
    ReleaseSafe,
    ReleaseFast,
    ReleaseSmall
};
const Endian = enum { // 1-bit tag type
    big,
    little,
};
pub const AtomicOrder = enum { // 3-bit tag type
    Unordered,
    Monotonic,
    Acquire,
    Release,
    AcqRel,
    SeqCst,
};
pub const AtomicRmwOp = enum { // 4-bit tag type
    Xchg,
    Add,
    Sub,
    And,
    Nand,
    Or,
    Xor,
    Max,
    Min,
};
const MyFancyType = struct {
    mode: Mode align(0),
    endian: Endian align(0),
    atomic_order: AtomicOrder align(0),
    op: AtomicRmwOp align(0),
};

If you add up all the bits of the tag type, it comes out to 10, meaning that the size of MyFancyType would have to be 2 bytes. However, with ranged integers as tag types, zig would be able to flatten out all the enum tag values into one byte. In fact there are only 21 total tag types here, leaving room for 235 more total tags before MyFancyType would have to gain another byte of size.


This proposal would solve #747. Peer type resolution of comptime ints would produce a ranged integer:

export fn foo(b: bool) void {
    // master branch: error: cannot store runtime value in type 'comptime_int'
    const x = if (b) -10 else 100;
    // proposal: @typeOf(x) == @Int(-10, 101)
}

With optional pointers, Zig has an optimization to use the zero address as the null value. The allowzero property can be used to indicate that the address 0 is valid. This is effectively treating the address as a ranged integer type! This optimization for optional pointers could now be described in userland types:

const PointerAddress = @Int(1, 1 << usize.bit_count);
const Pointer = ?PointerAddress;

comptime {
    assert(@sizeOf(PointerAddress) == @sizeOf(usize));
    assert(@sizeOf(Pointer) == @sizeOf(usize));
}

One possible extension to this proposal would be to allow pointer types to override the address integer type. Rather than allowzero which is single purpose, they could do something like this:

comptime {
    assert(*addrtype(usize) i32 == *allowzero i32);
    assert(*addrtype(@Int(1, 1 << usize.bit_count)) == *i32);
}

This would also introduce type-safety to using more than just 0x0 as a special pointer address, which is perfectly acceptable on most hosted operating systems, and also typically set up in freestanding environments as well. Typically, the entire first page of memory is unmapped, and often the virtual address space is limited to 48 bits making @Int(os.page_size, 1 << 48) a good default address type for pointers on many targets! Combining this with the fact that pointers also have alignment bits to play with, this would give Zig's type system the ability to pack a lot of data into pointers which are annotated with align(0).


What about two's complement wrapping math operations? Two's complement only works on powers-of-two integer types. Wrapping math operations would not be allowed on non-power-of-two integer types. Compile error.

@andrewrk andrewrk added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Nov 29, 2019
@andrewrk andrewrk added this to the 0.7.0 milestone Nov 29, 2019
@daurnimator
Copy link
Contributor

I'm worried this might result in a code explosion as different types are passed to a generic.

Idea: input to a generic could be "marked" for which properties are used.
e.g. if I have a pure-storage generic, like ArrayList, then the range of the members doesn't really matter for the implementation of that generic: the only property it cares about at runtime is the size of the type, the rest is only useful at comptime.

@Rocknest
Copy link
Contributor

@andrewrk i cannot understand how did you get required size of MyFancyType struct 'less that one byte ... leaving room for 235 more tags'.

Since struct is a product type, MyFancyType has a total of 4 * 2 * 6 * 9 = 432 possible states. Requiring at least ⌈log₂(432)⌉ = 9 bits to store them.

@andrewrk
Copy link
Member Author

@Rocknest my math is wrong, thanks for the correction 👍

@emekoi
Copy link
Contributor

emekoi commented Dec 1, 2019

wouldn't it be @Type(TypeInfo.Int{...}) where TypeInfo.Int looks like:

pub const Int = struct {
    is_signed: bool,
    bits: comptime_int,
	from: comptime_int,
    to: comptime_int,
};

because of the @Type builtin?

@andrewrk
Copy link
Member Author

andrewrk commented Dec 1, 2019

Yes except only from and to are needed - the rest is redundant information. I'd call it start and end to hint that the upper bound is exclusive. Good point, I forgot that @Int is deprecated.

@Int(0, 1) == @Type(.{ .Int = .{ .start = 0, .end = 1}})

@Type is a bit more verbose though, for such a fundamental type as an integer. I'm not 100% sold on @Type yet.

@emekoi
Copy link
Contributor

emekoi commented Dec 1, 2019

should this be marked a breaking proposal as well?

@andrewrk andrewrk added the breaking Implementing this issue could cause existing code to no longer compile or have different behavior. label Dec 1, 2019
@daurnimator
Copy link
Contributor

daurnimator commented Dec 1, 2019

Yes except only from and to are needed - the rest is redundant information.

You still need the bits. e.g. I might have an API that takes a usize but only allows values between 1 and 4096

@andrewrk
Copy link
Member Author

andrewrk commented Dec 1, 2019

  • 1 is >= 0, so it's unsigned
  • log2(4096) = 12, so it's 12 bits

You're thinking of a different, already planned issue, which is zig keeping track of additional information beyond the type, for expressions and const values, which will allow, for example, this test to pass:

test "runtime value hints" {
    var a: u64 = 99999;
    // `a` is runtime known because it is `var`.
    const b: u64 = a % 256;
    const c: u8 = b; // allowed without @intCast because `b` has a runtime value hint
}

@emekoi
Copy link
Contributor

emekoi commented Dec 1, 2019

i don't think so. take for example the WebSocket protocol, the opcodes are always 4 bits, but 0xB to 0xF are reserved for future use, so effectively they are not allowed.

@daurnimator
Copy link
Contributor

daurnimator commented Dec 1, 2019

i don't think so. take for example the WebSocket protocol, the opcodes are always 4 bits, but 0xB to 0xF are reserved for future use, so effectively they are not allowed.

That sounds like a usecase for a non-exhaustive enum:

enum(u4) {
    Continuation = 0,
    Text = 1,
    Binary = 2,
    Close = 8,
    Ping = 9,
    Pong = 10,
    _,
}

@BarabasGitHub
Copy link
Contributor

This is very nice and fancy, but I would like to note that it might have a negative effect on performance. This is definitely more efficient memory wise, so for storing it (on disk) this is perfect. However for passing it around in code in this format it would mean that every time there is a calculation bit fiddling comes into play to extract the value and put it back.

Just something to think about.

@andrewrk
Copy link
Member Author

andrewrk commented Feb 16, 2020

However for passing it around in code in this format it would mean that every time there is a calculation bit fiddling comes into play to extract the value and put it back.

This is not correct, except with align(0) fields (see #3802). Zig already has arbitrary sized integers. If you use i3, for example, @sizeOf(i3) == 1 because @alignOf(i3) == 1.

Also in many cases if you were to use align(0) fields, the smaller memory use on the cache lines would cause more performance gains than lost to the extra bit shifting.

@BarabasGitHub
Copy link
Contributor

BarabasGitHub commented Feb 17, 2020

Yes, sorry I was thinking about align(0) cases. For the other cases this proposal only adds range checks, right?

Also in many cases if you were to use align(0) fields, the smaller memory use on the cache lines would cause more performance gains than lost to the extra bit shifting.

I'm a bit sceptical about this as I've had experience of the opposite. It probably depends a lot on the circumstances. Which is why I don't think a compiler should be sneaky and perform bit fiddling behind your back. However if it just applies to align(0), it is something which you ask for so it's fine.

Somehow I feel though that this is more something to do with optionals and other constructs which have more information than arbitrary numbers and enums, right?

It's more like in this example:

const N = union(enum) {
    special1,
    special2,
    normal: @Int(0, (1 << u32) - 2),
};

Where each case has a special number or range of numbers, such as:

const N = union(enum) {
    special1 = max32,
    special2 = max32 -1,
    normal = 0..max32 - 2,
};

And I don't think it goes for your other example where you have four, what to me seems independant, enums:

const MyFancyType = struct {
    mode: Mode align(0),
    endian: Endian align(0),
    atomic_order: AtomicOrder align(0),
    op: AtomicRmwOp align(0),
};

If you add up all the bits of the tag type, it comes out to 10, meaning that the size of MyFancyType would have to be 2 bytes. However, with ranged integers as tag types, zig would be able to flatten out all the enum tag values into one byte. In fact there are only 21 total tag types here, leaving room for 235 more total tags before MyFancyType would have to gain another byte of size.

Aren't there 4 * 2 * 6 * 9 = 432 possible combinations of those four enums? Which means you still need 9 bits (or two bytes)?

In general I do think it would be great if the optionals can work more generically. Not having to write all the -1 checks would be cool. Especially since sometimes people work with indices instead of pointers and the special value trick is quite common.

@kyle-github
Copy link

kyle-github commented Feb 17, 2020

Note that integer ranges have a long history in some other languages like Pascal and Ada. There they are heavily used and can be used as ranges for arrays. This leads to some nice code where you can use ranged for loops without any possibility of going out of bounds.

It has been decades since I used Pascal or Ada in anger but even back then I seem to remember that most run time checks could be dropped as the compiler could prove that the use was safe. A very naive compiler would do the checks all the time, but existing compilers show that the possible overhead is quite low. As long as the programmer is aware (such as choosing different arithmetic with overflow wrapping) that this could have a cost, I think this is fine.

The driving case here: more compact representation of enums and optionals is a very strong one IMHO. Having just optionals use the same space as non-optional types would be a huge argument to get programmers to use them. The cache argument is a strong one too. A cache miss can cost 25-50 clock cycles (if you hit L2, worse if you fall all the way to RAM) and you can run a lot of bit twiddling code in that time.

@SpexGuy
Copy link
Contributor

SpexGuy commented Feb 19, 2020

I like the core of this proposal, but there are a few things I'm worried about.

First, I feel like the optimizations are too implicit. You have to hope that you left the correct number of values open. If you do it wrong, there is no indication that you have messed up, you just have bigger types than you expected. As code changes over time, it's easy for additional values to be added to enums that are sensitive to this, which may break these optimizations without warning.
If I'm writing Zig and my goal is to remove the overhead of the optional, I would want a guarantee from the compiler that my code has received the optimization, or a compile error. The original suggestion is difficult to think about when optional optimization is your end goal, and it fails silently when the optimization can't be applied.

I'm worried that, because it's so implicit, general zig advice may become "put explicit bounds on every integer type you use, so that the compiler can optimize it". This could create a lot of additional cognitive load when reading and writing zig code. Does ?@Int(0, 262144) receive the optimization? It doesn't, but it's not immediately apparent unless you happen to know off the top of your head that 262144 is 1<<18.

It also seems to me like working with these types would in many cases require @intCast peppered all over the place. If you want to iterate over the range, you need to be able to test when you've gone past the end. Which means you need to use a larger integer type for the iterator and cast it inside the loop. If you want to do math between ranged ints and store the result in a ranged int, you will very probably need a range cast. Those casts can harden the code, making it more difficult to change the ranges if your requirements change. They also have the potential to make generic code that uses ranged ints very difficult to write correctly. And if you get the range wrong, it can't be a compile error because it's a cast. It can only manifest as checked UB. How much mental bandwidth are you willing to spend on calculating the ranges of your intermediate values?

Finally, while applying range information to type resolution would solve #747, it also introduces a problem when assigning directly. var x = 4; would now compile, but set the type of x to @Int(4, 5), which is not a very useful type for a var. This would then move the compile failure to the line where x is modified, with a pretty cryptic error message (or worse, checked UB, depending on how arithmetic between ranged types resolves).

There is a precedent for an opaque optional optimization in Zig: ?*. This generates a sentinel value zero, and I think we can all agree that this is a Good Idea. Where I think the above proposal differs is that this new optimization is dependent on picking range constants that obey properties that are not immediately obvious. I think that we could go a long way with a simpler guarantee like "If there are unused bits in an integer type in a struct, an optional of that integer type is guaranteed to put a valid flag in the unused bits.". This is mechanically the same as making the flag in optionals align(0), but (at least for me) is a much easier wording to load into Brain RAM and work with. We could then add explicit mechanisms to trigger the other optimizations suggested above if you need them.

Here are some off-the-cuff suggestions. There's definitely room for improvement for these, but I think they do a good job of showing the direction I'm looking.

Range-checked Explicit Integers

These are the same as the original proposal, but the integer bit-width is specified by the programmer. For example, @RangedInt(u32, 5, 1<<32-1) or @RangedInt(comptime_int, 10, 101). If the range cannot be represented by the underlying type, it is a compile error. This provides a path for type resolution of arithmetic on ranged integers that won't introduce surprises if ranges change. The type resolution of addition of these values is the result type of addition of the explicit integer types of the two operands, restricted to the maximum range of addition of the two operands, saturated to the maximum representable range of the resolved bit type. For convenience, we could supply a library function SmallestRangedInt(comptime min_value: comptime_int, comptime exclusive_max_value: comptime_int) type. But I don't think an inferred type should be the language default.

Integer Unions:

An integer union (or alternatively, "partition" or "ranged enum") would be a union of non-overlapping ranged ints. It is guaranteed to be the size of its backing integer type.

const Foo = partition(u32) {
    normal = 0..(1 << 32) - 2,
    special1 = (1 << 32) - 2,
    special2 = (1 << 32) - 1,
};

This is very similar to the union example in the original proposal, but with this if the range is specified incorrectly or a new value is added, there would be a compile error instead of a drastic change in representation. If the integer type is not specified, it would be inferred from the ranges as the smallest type that can represent the range. But the ability to specify the bit width and receive an error if it can't be represented is valuable. Ranged unions could also support the '_' option from extensible unions.

Explicit Sentinel Optionals:

If an optional type could be specially constructed by specifying a sentinel, this would solve sentinel checks in a way that guarantees the optimization. Something like @OptionalSentinel(comptime IntType: type, comptime sentinel: meta.BackingIntegerType(IntType)). This documents the sentinel value, which makes debugging much less painful. It also guarantees that the optimization is made, so you don't need to worry that changes to the underlying int range will reintroduce the optional flag bit. IntType could be an integer, ranged integer, enum, or integer union.


None of these counter-proposals are perfect, but I like that they take more direct control over the optimization. E.g. instead of thinking "I'd like to optimize this optional to use a sentinel value, I should put ranges this int type everywhere it's used", I can instead think "I'd like to optimize this optional to use a sentinel value, I should tell zig to optimize this optional to use a sentinel value.".

@BarabasGitHub
Copy link
Contributor

BarabasGitHub commented Feb 20, 2020

I think that we could go a long way with a simpler guarantee like "If there are unused bits in an integer type in a struct, an optional of that integer type is guaranteed to put a valid flag in the unused bits.".

Sorry, small nitpick. It's not actually bits, it's unused values. For example in the pointer case 0 isn't a specific bit, it's just a special value. It's a slight difference. Same with the common -1/max-uint, it's not just the last bit, it's actually one value. But I see from your examples/proposals that you understood.

I agree with you that it would be cleaner if it's more explicit.

@SpexGuy
Copy link
Contributor

SpexGuy commented Feb 20, 2020

Sorry, yeah, my wording was unclear. I'm trying to convey that the simpler guarantee of a bit for optionals of non-power-of-two integers is easier to guarantee and think about than an inferred sentinel, and that something like @OptionalSentinel might be better for when a sentinel value is needed.

@ghost
Copy link

ghost commented May 19, 2020

If we make it so an optional ranged type interprets any out-of-range value as null, we could do things like this:

// Zero-cost wrapper for an ABI that uses negative values as error codes
const Result = fn (Int: Type) Type {
    return union {
        // Nullable limits to avoid generating unnecessary bounds checks
        success: @Ranged(Int, 0, null),
        failure: @Ranged(Int, null, -1),
    };
};

Define canonical null as the most negative value above the range if it exists, else the most positive value below the range. That way, even if a type is multiple-optionalised, the span of non-null values at any level is always contiguous, and does not change with type widening/narrowing in the most common cases.

Possible optimisation: allow nesting of @Ranged optionals, and only allow the ? type operator to be used on ranged types (pointer types are implicitly ranged), like so:

// Can be optimised to u32 with tag
// -- don't have to set global rules
const OptionalU32 = ?@Ranged(u32, null, maxInt(u32));

// Don't need another range -- canonical null of
// OptionalU32 is inner null, anything else is outer null
const OptionalOptionalU32 = ?OptionalU32;

// ;)
const Bool = ?@Ranged(u1, 1, null);
const True: Bool = 1;
const False: Bool = null;
const And = fn (T: Type, a: ?T, b: ?T) ?T {
    return if (a) { b } else { null };
};
const Or = fn (T: Type, a: ?T, b: ?T) ?T {
    return a orelse b;
};
const Not = fn (a: Bool) Bool {
    return if (a) { null } else { 1 };
};

Pro

  • More explicit in intention
  • Optionals can now be used in packed objects
  • Booleans don't have to be primitive ;-)
    Con
  • Optionalising arbitrary calltime-known type now impossible, though I can't think of a use case for this that isn't possible another way

@shawnl
Copy link
Contributor

shawnl commented Jun 8, 2020

You could use INT_MIN for null, as it already has some pretty funny properties. This still requires a new type (that interestingly would be agnostic to twos-complement and ones-complement), but it is far less invasion that this proposal, which IMHO belongs in an optimizing compiler and not zig.

@shawnl
Copy link
Contributor

shawnl commented Jun 18, 2020

We already have a ranged integer type. It is called enum. This proposal creates far more problems that it solves.

@mlugg
Copy link
Member

mlugg commented Jun 19, 2023

I've been thinking a lot about this proposal recently - I'm in strong support of it, and wanted to quickly note down some of the things it would make possible which I think are nice, and some interesting points about it.

  • As previously mentioned, this entirely eliminates the need for comptime_int. A literal 3 has type @Int(3, 4). PTR on int types @Int(a, b) and @Int(c, d) gives @Int(@min(a, c), @max(b, d)).
    • Note on representation: I believe the in-memory representation of an @Int(a, b) should be the smallest unsigned or signed twos-complement integer which can hold all values in that range. Except, maybe @Int(a, a 1) should be an exception and be a zero-bit type.
  • All standard arithmetic operations can create a precise result type, making overflow impossible. That way, all possible overflow panics are explicitly written in code using @intCast.
    • Bitwise, wrapping, and perhaps saturating operations all require the operands to be a perfect uX or iX
    • a % b where b is comptime-known has a result type of @Int(0, b)
  • This proposal meshes much more nicely with the new type-refining behavior of @min and @max. It'll make clamping to ranges actually work, and make the result type more precise, which is in particular helpful for switching on it.
    • e.g. @max(10, @min(my_i32, 150)) currently can't refine the type at all on its own, and doing @min(@max(10, my_i32), 150) only gets you down to a u8. Under the proposal, both of these would give you the exact correct type, @Int(10, 151).

One reasonable concern regarding automatically determined result types is the runtime cost. However, this should I believe be fairly easy to optimize in release modes; if the result of a chain of arithmetic is @intCast down to a small type, most of the time, it's safe for operands to be truncated to that type. For instance, if the result of a standard multiplication is intcast down to a u64 from some larger unsigned integer, then either the result is zero (in which case any truncation preserves the result) or both operands also fit in a u64 (so can in turn be intcast down). There are similar simple rules for most operations and types, so it should be fairly easy to write an optimization pass for this if LLVM isn't capable of doing a good job by itself.

I had more thoughts in my head, but they've kinda gone now. I'll edit this comment if they come back to me.

EDIT: one thing to note is that in debug builds, I think this type-expanding logic could at times give better runtime performance, since for a string of integer operations, it elides any safety checks right up until the @intCast of the final result, whereas in status quo every operation would have an overflow check. The intermediate types might be a bit bigger, so maybe e.g. some arithmetic ops require a couple of instructions (again, only in debug; in release the optimizations discussed above apply), but that's probably better than an overflow check and branch!

@andrewrk
Copy link
Member Author

I've been thinking along the same lines as you @mlugg and I definitely want to explore this direction before 1.0. This will likely be a huge investment in a large branch to explore the idea, but somebody's gotta do it.

@andrewrk
Copy link
Member Author

andrewrk commented Jul 4, 2023

@mlugg, it looks like @lerno has made some explorations of similar ideas with blog posts linked here: #16310 (comment)

I'm not sure how much those lessons will apply to Zig but seems worth reading about his experiences as some prior art related to this proposal.

@mlugg
Copy link
Member

mlugg commented Jul 4, 2023

@andrewrk, thanks for that link, those posts were interesting to read through (and thanks @lerno for writing them up!).

Personally, my main takeaway from them is that any reasonable solution which does not guarantee safety in the way we're discussing here tends to be problematic. The articles delve into a lot of detail on what are IMO rather complex solutions to the problem, and while they see some success, there's a tradeoff there with language complexity, which I think it's desirable for us to keep to a minimum. Of course, Zig's current rules are relatively simple, but one thing noted in the post is that expressions like const x: u32 = a (b * b) are very easy footguns, since it sort of looks like this should be fine if b * b fits in a u32, but if b is, say, 255: u8 it breaks. This shows, I think, that using PTR for arithmetic across the board is dangerous. There are of course many solutions to this, but I think it's important to keep in mind that the more complex the semantics of integer arithmetic are, the harder it is to look out for bugs when necessary.

What I like about this proposal is that it sidesteps that issue, whilst also being conceptually simple. When using standard arithmetic with this idea, it is effectively impossible to accidentally do the wrong thing. You can, for all intents and purposes, pretend you're just using bigints, eliminating the issues described in the blog posts. The eventual @intCast (or @truncate or whatever) makes the shortening very explicit, so the surface area for bugs becomes very obvious and intentional. At the same time, the change would not make integer arithmetic in general too painful to use, since the general pattern will probably be "perform all arithmetic then cast result down".

Assuming I'm right that this doesn't make integer arithmetic overly annoying to use, I suspect one of the key deciders for this proposal will end up being how well it can optimize, and whether you need to expend effort to get the optimizer to do what you want in performance-critical code. I obviously can't say how this will go, but I'm optimistic.

@andrewrk
Copy link
Member Author

andrewrk commented Jul 4, 2023

Thanks for looking into it. I agree with everything you said, and to add to that, one thing I'm concerned about is how easy or difficult it will be to choose integer types for function parameters. I think we just have to fuck around and find out.

@davidstone
Copy link

I have implemented such a family of types in C : https://github.com/davidstone/bounded-integer/blob/main/bounded-readme.md. I have not experienced any code size or run-time performance issues (and sometimes I get improvements), and I have a moderate-sized code base that uses bounded::integer pervasively. For instance, I have an implementation of std::vector (C 's dynamic array type) that is implemented in terms of this integer type and based on all of my test cases generates better code than any other standard library (although I believe most of this improvement is due to unrelated reasons, this shows it doesn't prevent optimization).

From the documentation:

bounded::integer<1, 100> const x = f();
bounded::integer<-3, 7> const y = g();
auto z = x   y;
static_assert(std::same_as<decltype(z), bounded::integer<-2, 107>>, "Type of z incorrect.");

After using this, I feel crippled when I'm forced to use the types of integers that come with most programming languages.

The main drawback is increased compile times, but a big portion of that is due to the cost of C template instantiation. I'm not sure how that differs in Zig and whether you'd be implementing it in the compiler.

@mlugg
Copy link
Member

mlugg commented Oct 1, 2023

@davidstone Thank you for this, it's a very useful data point.

I have no reason to believe an implementation of this would lead to any significant changes in compile times - the only thing I can think of is that when dealing with particularly large types, you may be performing a fair bit more bigint arithmetic, but that would only matters for types with bounds larger than probably i64/u64. I still doubt that would have any noticeable perf impact, since the vast majority of time in semantic analysis is not spent on these simple arithmetic operations.

andrewrk added a commit to andrewrk/flag2struct that referenced this issue Feb 7, 2024
andrewrk added a commit that referenced this issue Mar 21, 2024
When the slice-by-length start position is runtime-known, it is likely
protected by a runtime-known condition and therefore a compile error is
less appropriate than a runtime panic check.

This is demonstrated in the json code that was updated and then reverted
in this commit.

When #3806 is implemented, this decision can be reassessed.

Revert "std: work around compiler unable to evaluate condition at compile time"
Revert "frontend: comptime array slice-by-length OOB detection"

This reverts commit 7741aca.
This reverts commit 2583b38.
TUSF pushed a commit to TUSF/zig that referenced this issue May 9, 2024
When the slice-by-length start position is runtime-known, it is likely
protected by a runtime-known condition and therefore a compile error is
less appropriate than a runtime panic check.

This is demonstrated in the json code that was updated and then reverted
in this commit.

When ziglang#3806 is implemented, this decision can be reassessed.

Revert "std: work around compiler unable to evaluate condition at compile time"
Revert "frontend: comptime array slice-by-length OOB detection"

This reverts commit 7741aca.
This reverts commit 2583b38.
andrewrk added a commit that referenced this issue Aug 24, 2024
This reverts commit cb5a6be.

I deeply apologize for the churn.

This change is problematic given that we do not have ranged integers
(yet? see #3806).

In the meantime, this type needs to be `usize`, matching the length and
index types for all std lib data structures.

Users who want to save memory should not use heap-allocated BoundedArray
values, since it is inherently memory-inefficient. Use a different
memory layout instead.

If #3806 is accepted and implemented, the length value can become an
integer with the appropriate range, without the footgun. If that
proposal is not accepted, len type will remain a usize.
@mnemnion
Copy link

mnemnion commented Sep 21, 2024

It seems worth observing that this offers a possible solution to the felt need in #17039/#14704. Andrew made a comment to that effect in the first of those issues, I wanted to add some notes about it here.

If and when integers are defined as a range, rather than a width, it could be worthwhile to make this legal:

for (u8) |byte| {
   // ...
}

If I saw this I would have no question about what values byte will take, or the type, or the order in which they're iterated.

If that's too weird, it could take a builtin:

for (@range(u8)) |byte| {
    // ...
}

But I think that in a world where integers are ranges, it would be eloquent for them to coerce to an iteration of that range in a for loop position. Anything involving a different order or stride would need to use a while loop, this approach wards off the temptation to add 'just a bit more syntax', it treats an integer type as implicitly iterable, like an array or slice.

This brings me to a point which is tangential to the main thrust of this comment, yet necessary for what follows: ranged integers should be defined inclusively, as closed intervals. I feel that this accords with how we reckon with integers already: I think of a u8 as [0, 255], not [0, 256): everything which fits in eight bits, nothing which does not.

I do understand why using the exclusive range felt more natural, of course, but we really want it to be inclusive: most of this comment serves as the argument for that, although this point is not why I came to this issue today.

That would make u0 == @Int(0,0), u1 == @Int(0,1), and so on. It wouldn't be possible to define noreturn as an integer type, which I think is correct: an uninhabited integer range is not a necessary or especially coherent thing to have, the empty set of integers (if needed) can be void. We don't currently have a way to express an uninhabited integer type, and I don't see a compelling need for it: it could be spelled @Int(1,0), there's precedent, but I would vote against this. Zig might benefit from an uninhabited bottom type, which noreturn is not (#6602-comment ), but this shouldn't involve integers in any way.

This is how Ada does it, and with reason. Ada also has a 'modular' type, which also specifies wraparound arithmetic for the type. Since this uses modulus, it mentions the first unsigned integer which isn't in the range. Zig uses explicit modular operators, so this is probably not a useful concept for us.

Another data point is the C standard, Section 7.20.1. It doesn't use interval notation at all, but it defines macros for limits in terms of minimum and maximum (inclusive) values.

Inclusive is the better convention in its own right, and also best compatible with the for loop coercion just described. It would be weird if coercing u8 to a range didn't include 255. That was an unexamined problem with #17039, writing something like for (0..256) |byte: u8| is.. legible, but unsatisfyingly inconsistent, normally you can't say 256 and u8 in the same breath. This is in fact a variation on why integer types should be described inclusive of their extreme values.

I broadly agree with not having two range syntaxes for a for loop, certainly not if they vary by one ., and given that, "slice style" instead of "switch style" is the better choice. But both of these are useful nonetheless, and this would give us both.

There's another place this coercion could apply: switch statements.

fn byteSwitch(b: u8) void {
    const a_range: type = @Int(0, 0x7f)
    const b_lead: type = @int(0xc2, 0xdf)
    // ... 
    switch (b) {
        a_range => ... ,
        b_range => ... ,
    }
}

These custom types coerce to u8, and it should be perfectly clear what's going on here. Switch ranges are already inclusive, as they should be, and as integer ranges should be as well.

For the same reason that it would be confusing to have an exclusive range in a switch statement, it would be confusing to have an exclusive range in a ranged integer definition.

Part of why I wrote this out was to be able to link to it later, but also on the premise that it's good to have discussions of the positive (or negative) consequences of a feature in the issue tracking it. I don't see a lot of point in writing the possible coercion features up as a separate issue until or unless this one is tagged 'accepted', although I'm willing to do so.

@mnemnion
Copy link

I also wanted to respond separately to @SpexGuy's comment, since it has coherent implementation concerns which haven't yet been completely addressed. That was four years ago, and Spex might have a completely different take now, I want to acknowledge that before continuing. Zig has also changed quite a bit in that period of time.

For me, the benefit of this proposal is lifting ranges up into the type system, and the niche optimizations are a nice bonus. I think we can come up with rules which are general, easy to understand, and cover all cases of optional integer types, whether by themselves or in a packed struct.

At the limit, someone concerned if they left a niche in their enum type can use a test: try expectEqual(@bitSizeOf(?E), @bitSizeOf(E));. I don't think losing an existing niche is that different from expanding the backing width of an enum which doesn't specify its integer type: they're both problems where it will either show up (invalidating the width of a packed struct) or if it's important it's worth a test. The literal first Zig file I ever wrote had a test asserting the size of the tagged union I was writing, I think it's basically ok to require users to assert invariants which need to stay that way.

Another solution is this:

const ThreeEnum = enum(@Int(1,3)) {
     fee,
     fie,
     foe,
};

This seems like something which would have to be legal if integer ranges are just integers. This guarantees (according to what follows) an optional sentinel at zero.

It seems to me that the solution to the limited-range problem of assigning comptime ints is something we'd have to do already: you need to provide a type for a var, but not a const. So it would be fine that const a = 4; gets the type @Int(4,4): but a var needs more than a value, it needs a restriction on the future values it can take, and a comptime int value can't provide that. I'd suggest this would apply with or without integer range types, unless we wanted to punt and just say that a comptime_int coerces to a usize or isize when no type is provided. I would prefer that not happen.

Niche Optimization of Ranged Integer Types

I propose a rule like the following: A ranged integer type has an automatic bitwidth, which is the minimum necessary to hold every value in the range. If there are any negative values in the range, this is a signed width, otherwise, unsigned.

If there are representable values in that width which are not part of the range, then the first value higher than the maximum, if it exists, is the optional sentinel, otherwise, the first value lower than the minimum.

If there are no representable values in the bitwidth (these are the integers we have already) then the original bitwidth is sign-extended by one bit, and the first value of the new width which is higher than the maximum of the old width is the optional.

I think this covers all the bases. Part of what Spex was addressing is combinatoric difficulties with the align(0) part of the proposal, which were predicated on another proposal which didn't make it. This rule makes it simple enough to add integer ranges to packed structs, and it reflects the reality of the CPUs the code will ultimately run on.

This rules out some of the more exotic 'optimizations', which would be space-optimizing only, and at the expense of everything else. For example, [-3, 240] could be represented in eight bits, this would require it to have a bitwidth of i9 instead. I think we actually want this property, just on the grounds of sanity: imagine debugging a [450, 700] which is stored inside a single byte! This is something we could do, but really, truly, shouldn't. That range would have a u10 bitwidth, and what we get in return is efficiency and coherence.

Even in the event of following in the footsteps of Pascal and Ada by having arrays with widths and indices defined by a ranged integer, the offsetting of an index into that array should be at the point of use, not in the storage value of a ranged integer used for indexing that array. This is the tip of a huge iceberg, which I will now veer away from.

Alternative version of the rule which is actually better: for an unsigned range, if there are available values higher than the maximum, the highest value which fits in the bitwidth is chosen, otherwise a bit is added and the sentinel is the maximum value of the new width. For a signed range, if there are available values lower than the minimum, the lowest possible value of the width is chosen, otherwise the highest possible value, otherwise it is sign-extended by one and the lowest value of the new bitwidth is the optional sentinel.

My little bait-and-switch here is because that formula is significantly harder to understand, and looks worse at a first glance. But it's better, because it always chooses all 1 bits when it can, and when it can't, it chooses 0 followed by all 1s. This is the opposite of ?*T, but the same as ?*T will frequently not be available, and it's the exact opposite, which is nice. This is less for the compiler to think about, collapsing cases is generally a good thing.

There might be a case for adding "if zero is not in the range, then zero is the sentinel" but I wouldn't make it. Someone else might, but I'm sensitive to the fact that the second version of the rule is more complex, and therefore harder to understand. I'll acknowledge that zero is a good value to be working with when it comes time to emit machine code, however.

I could also be wrong about the more complex rule being better for the compiler, but it would be easier to debug, and that's not nothing.

This neglects the actual size which the integer will take, and that's intentional. A u9 would have an optional bitwidth of u10, and a sentinel of 0b1111_1111_11, and when packed, would be exactly that many bits. On its own, or in an array, it would be of bitsize 16, or sizeof 2, and the sentinel would more than likely be 0b1111_1111_1111_1111, although note that I'm assuming a convention which might not turn out to be the correct one, this is for illustration purposes only.

As a note, Zig's documentation doesn't specify how non-natural-width integers are represented on the machine, and perhaps that's intentional, but, maybe it should.

Back to the comment I'm reply to: I don't think we want either of "bring your own bitwidth" or "bring your own sentinel". That's creating opportunities for incompatibility where there otherwise would not be. It seems to me that even in the packed case, a width wider than the minimum dictated by the rules here can be emulated by adding a pad field after the ranged integer.

It also brings the weirdness that a u32[0,777] should coerce to a u16[0,1000], since it inherently fits. But we don't let a u32 coerce to a u16. This could be made to work, sure, but I think it's causing more problems than it solves.

The major reason to even define what the sentinel will be is debugging, it isn't a value that user code has native access to, although presumably this would reveal it:

const opt_ranged: ?@Int(1, 14) = null;
const as_cast: u8 = @bitCast(opt_ranged);

And I think the fact that the above would be possible, and debugging, are good enough reasons to have the sentinel value be a specified part of the language.

This may be irrelevant, given that align(0) is closed, but I also see little value in allowing enums to combine in the align(0) style of the first post. One consequence of that would be making hex dumps and other debug info very opaque indeed. Set unions of enums have been proposed before, and they have the same basic issues as densely packing logically-separate enums into a single structure: instead of one concrete value, enums could have up to several, and that is most likely the wrong tradeoff to make.

I'd argue that even for packed structs, trying to minimize space in a range like [256, 511] should be resisted. If it's imperative that values of that nature fit in eight bits and not nine, it's reasonable to do that with arithmetic. Things would just get weird otherwise.

The partitioned integer is a cool idea though. I haven't a clue how feasible it is, but it's probably its own issue rather than this one.

@davidstone
Copy link

You'll have to forgive that all of my example code here uses C syntax, I'm not familiar enough with Zig to be confident giving the Zig equivalent.

My C library defines the range bounds inclusively (integer<0, 3> can take any of the values in the set {0, 1, 2, 3}), and in my usage that has almost always been the more convenient choice and I can't remember any examples where it was confusing and caused a bug. The primary time that comes to mind where I wanted a half-open range where when I want to get the index type of a range (integer<0, max_size - 1>), but I just added an index_type that does the that for me and then never worried about it again.

I agree that we should make the representation of the integer type match its value. In other words, integer<10, 15>(10) should store bits that represent 10, not 0. This means that an integer type like integer<256, 260> requires 2 bytes of storage even though it theoretically only needs 1, but in exchange arithmetic and comparisons are a single instruction. In cases where you are very size constrained, it's fairly easy to write a packed_integer type that performs the necessary arithmetic adjustments for you, but you'll get much better code generation for the common case by not offsetting.

My integer type also works with a compressed optional implementation -- it's split into the general optional type and an associated traits class to manage the optimization for integer types specifically. The general idea is similar to what was outlined here: use the unused values as sentinel values. My implementation uses the least representable value as the first tombstone value and counts up from there, skipping over the valid value range. So if integer<1, 10> is represented by a uint8_t, a disengaged optional of that type would be represented by 0x00. The tombstone_traits idea allows space-efficient optional<optional<T>>, as well as more general tombstone-style code (for instance, some hash maps need two tombstone values, so in that example the two representations would be 0x00 and 0x0A). I chose this only because that ensures that for unsigned underlying storage, the representation of a disengaged optional<T> is all 0x00 bytes. I don't know if that actually matters for anything, but every other choice seemed more complex or even less likely to have a beneficial representation.

I much prefer the design of my library where the user says "I want an integer in this range, you figure out how to store it (integer<5, 10>)" over a design where the user says "I want an integer in this range that is stored in this underlying type (u8<5, 10>)".

@davidstone
Copy link

It's probably not clear by casual inspection and I don't know how optional support works in Zig, but in my C library, the type optional<integer<0, 254>> takes 1 byte of storage with the value of 255 being the sentinel and it follows the compressed code path, but the type optional<integer<0, 255>> takes two bytes of storage and it follows the regular uncompressed codepath that is used by default for any types that don't have the ability to store sentinel values in them. This is shown in optional.cpp by looking at the optional<nullable T> specialization vs the optional<typename T> base template.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Implementing this issue could cause existing code to no longer compile or have different behavior. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests