[llvm-dev] [RFC] Coding Standards: "prefer `int` for regular arithmetic, use `unsigned` only for bitmask and when you intend to rely on wrapping behavior."

Tue Jun 18 04:48:37 PDT 2019

My 2 cents on this.

I watched Chandler's talk, and it looks like there are two statements that are a bit
contradictory one to another:

https://www.youtube.com/watch?v=yG1OZ69H_-o&t=2820s "it's fairly easy to get enough bits
w/o using the sign bits"
versus
https://www.youtube.com/watch?v=yG1OZ69H_-o&t=2846s "using size_t takes more space in data
structures"

Why not using size_t for indexes/length, while storing uint32_t integers and cast them
back to size_t when necessary (i.e. for space reasons)?

About checking for overflows, let's take the various examples here:
https://godbolt.org/z/TIGiVe, which are various ways to load a byte at index "n+10" where
n is user-controlled. In this case, we need to check that "n+10" does not overflow.
The first thing is that, IMHO, overflow checks are tricky to get right for both the signed
and unsigned case (and has led to numerous vulnerabilities). But that's a personal
opinion. If we talk about code size (and performance), the load3 function confirms what
Chandler showed in his talk (using uint32_t can be a performance issue on some 64-bit
systems, do to the enforced wrapping that has to be done on 32bits integers). The other
load* functions all emit the intended "mov eax, DWORD PTR [rdi+40+rsi*4]". Notable
differences arise in load2, which needs to sign extend esi to rsi. In load1, there is also
an extra instruction to set the value of (2**63-12).

One conclusion is that, at least on x86-64, using size_t for these kind of overflow checks
is the most efficient. But I might miss something / do something wrong in this analysis,
and I'd like another POV on this!

One argument for the usage of signed integers (actually integer types that don't allow
wrapping) is indeed the ability to automatically insert these overflow checks at compile time.

On 6/8/19 7:17 PM, Mehdi AMINI via llvm-dev wrote:
> Hi,
> 
> The LLVM coding style does not specify anything about the use of signed/unsigned integer,
> and the codebase is inconsistent (there is a majority of code that is using unsigned index
> in loops today though).
> 
> I'd like to suggest that we specify to prefer `int` when possible and use `unsigned` only
> for bitmask and when you intend to rely on wrapping behavior,
> see: https://reviews.llvm.org/D63049
> 
> A lot has been written about this, good references are [unsigned: A Guideline for Better
> Code] https://www.youtube.com/watch?v=wvtFGa6XJDU) and [Garbage In, Garbage Out: Arguing
> about Undefined Behavior...](https://www.youtube.com/watch?v=yG1OZ69H_-o), as well as this
> panel discussion:
> - https://www.youtube.com/watch?v=Puio5dly9N8#t=12m12s
> - https://www.youtube.com/watch?v=Puio5dly9N8#t=42m40s
> 
> Other coding guidelines already embrace this:
> 
> - http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Res-signed
> - http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Res-unsigned
> - https://google.github.io/styleguide/cppguide.html#Integer_Types
> 
> It is rare that overflowing (and wrapping) an unsigned integer won't trigger a program bug
> when the overflow was not intentionally handled. Using signed arithmetic means that you
> can actually trap on over/underflow and catch these bugs (when using fuzzing for instance).
> 
> Chandler explained this nicely in his CPPCon 2016 talk "Garbage In, Garbage Out: Arguing
> about Undefined Behavior...". I encourage to watch the full talk but here is one relevant
> example: https://youtu.be/yG1OZ69H_-o?t=2006 , and here he mentions how Google
> experimented with this internally: https://youtu.be/yG1OZ69H_-o?t=2249
> 
> Unsigned integer also have a discontinuity right to the left of zero. Suppose A, B and C
> are small positive integers close to zero, say all less than a hundred or so. Then given:
> A + B > C
> and knowing elementary school algebra, one can rewrite that as:
> A > B - C
> But C might be greater than B, and the subtraction would produce some huge number. This
> happens even when working with seemingly harmless numbers like A=2, B=2, and C=3.
> 
> -- 
> Mehdi
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>