[LLVMdev] C as used/implemented in practice: analysis of responses

Wed Jul 1 09:58:24 PDT 2015

On 1 July 2015 at 17:15, Russell Wallace <russell.wallace at gmail.com> wrote:
> I'm proposing that LLVM unilaterally replace most undefined behaviour with
> implementation-defined behaviour.

That's precisely the problem. Which behaviour?

Let's have an example:

struct Foo {
  long a[95];
  char b[4];
  double c[2];
};

void fuzz(Foo &F) {
  for (int i=0; i<100; i++)
    F.a[i] = 123;
}

There are many ways I can do this "right":

1. Only go up to 95, since you're using an integer to set the value.
2. Go up to 96, since char is an integer type.
2. Go all the way to 100, but casting "123" to double from 97 onwards, in pairs
3. Go all the way to 100, and set integer 123 bitwise (for whatever fp
representation that is) from 97
4. Do any of above, and emit a warning
5. Bail on error

Compilers prefer not to bail on error, since the standard permits it.
A warning would be a good thing, though.

Now, since it's a warning, I *have* to output something. What? Even
considering one compiler, you'll have to convince *most* <compilerX>
engineers to agree on something, and that's not trivial.

Moreover, this loop is very easy to vectorise, and that would give me
4x speed improvements for 4-way vectorization. That's too much for
compilers to pass.

If I create a vectorised loop that goes all the way to 92, I'll have
to create a tail loop. If I don't want to create a tail loop, I have
to override 'b' (and probably 'c') on a vector write. If I implement
the variations where I can do that, the vectoriser will be very happy.
People generally like when the vectoriser is happy.

Now, you have a "safe mode" where these things don't happen. Let's say
you and me agree that it should only go to 95, since this is "probably
what the user wants". But some programmers *use* that as a feature,
and the standard allow it, so we *have* to implement it *both*.

Best case scenario, you have now implemented two completely different
behaviours for every undefined behaviour in each standard. Worse
still, you have divided the programmers in two classes: those that
play it safe, and those that don't, essentially creating two different
programming languages. Code that compiles and work with
compilerA+safe_mode will not necessarily compile/work with
compilerB+safe_mode or compilerA+full_mode either.

C and C++ are already complicated enough, with so many standard levels
to implement (C90, C99, C11, C++03, C++11, C++14, etc) that
duplicating each and everyone of them, *per compiler*, is not
something you want to do.

That will, ultimately, move compilers away from each other, which is
not what most users really want.

cheers,
--renato