[LLVMdev] C as used/implemented in practice: analysis of responses

Duncan P. N. Exon Smith dexonsmith at apple.com
Sat Jun 27 09:01:13 PDT 2015


> On 2015 Jun 26, at 17:02, Peter Sewell <Peter.Sewell at cl.cam.ac.uk> wrote:
> 
> On 26 June 2015 at 22:53, Sean Silva <chisophugis at gmail.com <mailto:chisophugis at gmail.com>> wrote:
>> All of these seem to fall into the pattern of "The compiler is required to
>> do what you expect, as long as it can't prove X about your program". That
>> is, the only reasonable compilation in the absence of inferring some extra
>> piece of information about your program, is the one you expect. For example,
>> the only way to codegen a comparison between two random pointers has the
>> meaning you expect (on common computer architectures); but if the compiler
>> can figure something out that tells it that comparing those two pointers is
>> undefined by the language standard, then, well, technically it can do
>> whatever it wants.
>> 
>> Many people interpret this as the compiler being somewhat malevolent, but
>> there's another interpretation in some cases.
>> 
>> I have not looked in depth at the history in all the undefined behaviors
>> mentioned in the survey, but some of the undefined behaviors are there
>> because at some point in time the underlying system diversity made it
>> difficult or impossible to assign a meaning. So long as the diversity that
>> led to the desire to leave something undefined still exists, programs that
>> use those constructs with certain expectations *will* fail to behave as
>> "expected" on those targets (on a system where pointers are represented
>> differently, your program *may* actually format your hard disk if you do
>> so-and-so!).
>> 
>> To put it another way, what is "expected" is actually dependent on the C
>> programmer's knowledge of the underlying system (computer architecture,
>> system architecture, etc.), and there will always be tension so long as the
>> programmer is not thinking about what the C language guarantees, but rather
>> (roughly speaking) how *they* would translate their code to assembly
>> language for the system or systems that they happen to know they're
>> targeting. An x86 programmer doesn't expect unaligned loads to invoke nasal
>> demons, but a SPARC programmer does.
>> 
>> So if you unravel the thread of logic back through the undefined behaviors
>> made undefined for this reason, many of these cases of exploiting undefined
>> behavior are really an extension, on the compiler's part, of the logic
>> "there are some systems for which your code would invoke nasal demons, so I
>> might as well assume that it will invoke nasal demons on this system (since
>> the language standard doesn't say anything about specific systems)". Or to
>> put it another way, the compiler is effectively assuming that your code is
>> written to target all the systems taken into account by the C standard, and
>> if it would invoke nasal demons on any one of them then the compiler is
>> allowed to invoke nasal demons on all of them.
> 
> Sure.  However, we think we have to take seriously the fact that a
> large body of critical code out there is *not* written to target what
> the C standard is now, and it is very unlikely to be rewritten to do
> so.

In case you're not aware of it, here's a fairly relevant blog series on
the topic of undefined behaviour in C:

http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know_14.html
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know_21.html

> 
> At the end of the day, code is not written purely by "thinking about
> what the C language guarantees", but rather by test-and-debug cycles
> that test the code against the behaviour of particular C
> implementations.  The ISO C standard is a very loose specification,
> and we do not have good tools for testing code against all the
> behaviour it permits,

*cough* -fsanitize=undefined *cough*

http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation

> so that basic development technique does not -
> almost, cannot - result in code that is robust against compilers that
> sometimes exploit a wide range of that behaviour.
> 
> It's also the case that some of the looseness of ISO C relates to
> platforms that are no longer relevant, or at least no longer
> prevalent.  We can at least identify C dialects that provide stronger
> guarantees for the rest.
> 
> thanks,
> Peter
> 
> 
>> This is obviously sort of a twisted logic, and I think that a lot of the
>> "malevolence" attributed to compilers is due to this. It certainly removes
>> many target-dependent checks from the mid-level optimizer though.





More information about the llvm-dev mailing list