[LLVMdev] C as used/implemented in practice: analysis of responses

Peter Sewell Peter.Sewell at cl.cam.ac.uk
Sun Jun 28 01:28:06 PDT 2015


On 27 June 2015 at 17:01, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:
>
>> On 2015 Jun 26, at 17:02, Peter Sewell <Peter.Sewell at cl.cam.ac.uk> wrote:
>>
>> On 26 June 2015 at 22:53, Sean Silva <chisophugis at gmail.com <mailto:chisophugis at gmail.com>> wrote:
>>> All of these seem to fall into the pattern of "The compiler is required to
>>> do what you expect, as long as it can't prove X about your program". That
>>> is, the only reasonable compilation in the absence of inferring some extra
>>> piece of information about your program, is the one you expect. For example,
>>> the only way to codegen a comparison between two random pointers has the
>>> meaning you expect (on common computer architectures); but if the compiler
>>> can figure something out that tells it that comparing those two pointers is
>>> undefined by the language standard, then, well, technically it can do
>>> whatever it wants.
>>>
>>> Many people interpret this as the compiler being somewhat malevolent, but
>>> there's another interpretation in some cases.
>>>
>>> I have not looked in depth at the history in all the undefined behaviors
>>> mentioned in the survey, but some of the undefined behaviors are there
>>> because at some point in time the underlying system diversity made it
>>> difficult or impossible to assign a meaning. So long as the diversity that
>>> led to the desire to leave something undefined still exists, programs that
>>> use those constructs with certain expectations *will* fail to behave as
>>> "expected" on those targets (on a system where pointers are represented
>>> differently, your program *may* actually format your hard disk if you do
>>> so-and-so!).
>>>
>>> To put it another way, what is "expected" is actually dependent on the C
>>> programmer's knowledge of the underlying system (computer architecture,
>>> system architecture, etc.), and there will always be tension so long as the
>>> programmer is not thinking about what the C language guarantees, but rather
>>> (roughly speaking) how *they* would translate their code to assembly
>>> language for the system or systems that they happen to know they're
>>> targeting. An x86 programmer doesn't expect unaligned loads to invoke nasal
>>> demons, but a SPARC programmer does.
>>>
>>> So if you unravel the thread of logic back through the undefined behaviors
>>> made undefined for this reason, many of these cases of exploiting undefined
>>> behavior are really an extension, on the compiler's part, of the logic
>>> "there are some systems for which your code would invoke nasal demons, so I
>>> might as well assume that it will invoke nasal demons on this system (since
>>> the language standard doesn't say anything about specific systems)". Or to
>>> put it another way, the compiler is effectively assuming that your code is
>>> written to target all the systems taken into account by the C standard, and
>>> if it would invoke nasal demons on any one of them then the compiler is
>>> allowed to invoke nasal demons on all of them.
>>
>> Sure.  However, we think we have to take seriously the fact that a
>> large body of critical code out there is *not* written to target what
>> the C standard is now, and it is very unlikely to be rewritten to do
>> so.
>
> In case you're not aware of it, here's a fairly relevant blog series on
> the topic of undefined behaviour in C:
>
> http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
> http://blog.llvm.org/2011/05/what-every-c-programmer-should-know_14.html
> http://blog.llvm.org/2011/05/what-every-c-programmer-should-know_21.html

We're aware of those, thanks.

>>
>> At the end of the day, code is not written purely by "thinking about
>> what the C language guarantees", but rather by test-and-debug cycles
>> that test the code against the behaviour of particular C
>> implementations.  The ISO C standard is a very loose specification,
>> and we do not have good tools for testing code against all the
>> behaviour it permits,
>
> *cough* -fsanitize=undefined *cough*

That (and other such tools) is surely a *lot* better than what we had
before, no doubt about that.  And its developers and those who use it
heavily should be in a good position to comment on our survey
questions, as they are up against the same basic problem, of
reconciling what existing C code actually does vs what compilers
assume about it, to detect errors without too many false positives.
We had quite a few survey responses saying something like "sanitisers
have to allow XYZ, despite the ISO standard, because code really does
it"; in a sense, what we're doing is trying to clearly and precisely
characterise all those cases.   If you or others can help with that,
please do!

But such tools are, useful and impressive though they are, aren't
really testing code against all the behaviour ISO permits - as I
understand it, they are essentially checking properties of single
(instrumented) executions, while ISO is a very loose spec, e.g. when
it comes to evaluation order choices and implementation-defined
quantities, permitting many (potentially quite different) executions
for the same source and inputs.  Running with -fsanitize=undefined
will detect problems just on the executions that the current compiler
implementation happens to generate.  Of course, checking against all
allowed executions of a very loose spec quickly becomes
combinatorially infeasible, so this isn't unreasonable, but at lease
we'd like to have that gold standard precisely defined, and to be able
to pseudorandomly check against it.

thanks,
Peter




> http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation
>
>> so that basic development technique does not -
>> almost, cannot - result in code that is robust against compilers that
>> sometimes exploit a wide range of that behaviour.
>>
>> It's also the case that some of the looseness of ISO C relates to
>> platforms that are no longer relevant, or at least no longer
>> prevalent.  We can at least identify C dialects that provide stronger
>> guarantees for the rest.
>>
>> thanks,
>> Peter
>>
>>
>>> This is obviously sort of a twisted logic, and I think that a lot of the
>>> "malevolence" attributed to compilers is due to this. It certainly removes
>>> many target-dependent checks from the mid-level optimizer though.
>



More information about the llvm-dev mailing list