<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Jun 28, 2015 at 1:28 AM, Peter Sewell <span dir="ltr"><<a href="mailto:Peter.Sewell@cl.cam.ac.uk" target="_blank">Peter.Sewell@cl.cam.ac.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On 27 June 2015 at 17:01, Duncan P. N. Exon Smith <<a href="mailto:dexonsmith@apple.com">dexonsmith@apple.com</a>> wrote:<br>

><br>

>> On 2015 Jun 26, at 17:02, Peter Sewell <<a href="mailto:Peter.Sewell@cl.cam.ac.uk">Peter.Sewell@cl.cam.ac.uk</a>> wrote:<br>

>><br>

>> On 26 June 2015 at 22:53, Sean Silva <<a href="mailto:chisophugis@gmail.com">chisophugis@gmail.com</a> <mailto:<a href="mailto:chisophugis@gmail.com">chisophugis@gmail.com</a>>> wrote:<br>

>>> All of these seem to fall into the pattern of "The compiler is required to<br>

>>> do what you expect, as long as it can't prove X about your program". That<br>

>>> is, the only reasonable compilation in the absence of inferring some extra<br>

>>> piece of information about your program, is the one you expect. For example,<br>

>>> the only way to codegen a comparison between two random pointers has the<br>

>>> meaning you expect (on common computer architectures); but if the compiler<br>

>>> can figure something out that tells it that comparing those two pointers is<br>

>>> undefined by the language standard, then, well, technically it can do<br>

>>> whatever it wants.<br>

>>><br>

>>> Many people interpret this as the compiler being somewhat malevolent, but<br>

>>> there's another interpretation in some cases.<br>

>>><br>

>>> I have not looked in depth at the history in all the undefined behaviors<br>

>>> mentioned in the survey, but some of the undefined behaviors are there<br>

>>> because at some point in time the underlying system diversity made it<br>

>>> difficult or impossible to assign a meaning. So long as the diversity that<br>

>>> led to the desire to leave something undefined still exists, programs that<br>

>>> use those constructs with certain expectations *will* fail to behave as<br>

>>> "expected" on those targets (on a system where pointers are represented<br>

>>> differently, your program *may* actually format your hard disk if you do<br>

>>> so-and-so!).<br>

>>><br>

>>> To put it another way, what is "expected" is actually dependent on the C<br>

>>> programmer's knowledge of the underlying system (computer architecture,<br>

>>> system architecture, etc.), and there will always be tension so long as the<br>

>>> programmer is not thinking about what the C language guarantees, but rather<br>

>>> (roughly speaking) how *they* would translate their code to assembly<br>

>>> language for the system or systems that they happen to know they're<br>

>>> targeting. An x86 programmer doesn't expect unaligned loads to invoke nasal<br>

>>> demons, but a SPARC programmer does.<br>

>>><br>

>>> So if you unravel the thread of logic back through the undefined behaviors<br>

>>> made undefined for this reason, many of these cases of exploiting undefined<br>

>>> behavior are really an extension, on the compiler's part, of the logic<br>

>>> "there are some systems for which your code would invoke nasal demons, so I<br>

>>> might as well assume that it will invoke nasal demons on this system (since<br>

>>> the language standard doesn't say anything about specific systems)". Or to<br>

>>> put it another way, the compiler is effectively assuming that your code is<br>

>>> written to target all the systems taken into account by the C standard, and<br>

>>> if it would invoke nasal demons on any one of them then the compiler is<br>

>>> allowed to invoke nasal demons on all of them.<br>

>><br>

>> Sure.  However, we think we have to take seriously the fact that a<br>

>> large body of critical code out there is *not* written to target what<br>

>> the C standard is now, and it is very unlikely to be rewritten to do<br>

>> so.<br>

><br>

> In case you're not aware of it, here's a fairly relevant blog series on<br>

> the topic of undefined behaviour in C:<br>

><br>

> <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.llvm.org_2011_05_what-2Devery-2Dc-2Dprogrammer-2Dshould-2Dknow.html&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=Ny473GDW_-fudG0I3mSZwEQov1NUqtPWPyOVTj7eJkE&s=kKf3UKP8HtIknhFqPEKVW68jFIUxjb8yBYipQHvahAQ&e=" rel="noreferrer" target="_blank">http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html</a><br>

> <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.llvm.org_2011_05_what-2Devery-2Dc-2Dprogrammer-2Dshould-2Dknow-5F14.html&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=Ny473GDW_-fudG0I3mSZwEQov1NUqtPWPyOVTj7eJkE&s=y885xZxsfFH_VJYqGhgTp37QHF09OBjAr_QBWctwRSQ&e=" rel="noreferrer" target="_blank">http://blog.llvm.org/2011/05/what-every-c-programmer-should-know_14.html</a><br>

> <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.llvm.org_2011_05_what-2Devery-2Dc-2Dprogrammer-2Dshould-2Dknow-5F21.html&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=Ny473GDW_-fudG0I3mSZwEQov1NUqtPWPyOVTj7eJkE&s=YQ76M4zSCpt7dQrhUf7ifQewBeTuFexZ-hARi9dJtDE&e=" rel="noreferrer" target="_blank">http://blog.llvm.org/2011/05/what-every-c-programmer-should-know_21.html</a><br>

<br>

</div></div>We're aware of those, thanks.<br>

<span class=""><br>

>><br>

>> At the end of the day, code is not written purely by "thinking about<br>

>> what the C language guarantees", but rather by test-and-debug cycles<br>

>> that test the code against the behaviour of particular C<br>

>> implementations.  The ISO C standard is a very loose specification,<br>

>> and we do not have good tools for testing code against all the<br>

>> behaviour it permits,<br>

><br>

> *cough* -fsanitize=undefined *cough*<br>

<br>

</span>That (and other such tools) is surely a *lot* better than what we had<br>

before, no doubt about that.  And its developers and those who use it<br>

heavily should be in a good position to comment on our survey<br>

questions, as they are up against the same basic problem, of<br>

reconciling what existing C code actually does vs what compilers<br>

assume about it, to detect errors without too many false positives.<br>

We had quite a few survey responses saying something like "sanitisers<br>

have to allow XYZ, despite the ISO standard, because code really does<br>

it"; in a sense, what we're doing is trying to clearly and precisely<br>

characterise all those cases.   If you or others can help with that,<br>

please do!<br>

<br>

But such tools are, useful and impressive though they are, aren't<br>

really testing code against all the behaviour ISO permits - as I<br>

understand it, they are essentially checking properties of single<br>

(instrumented) executions, while ISO is a very loose spec, e.g. when<br>

it comes to evaluation order choices and implementation-defined<br>

quantities, permitting many (potentially quite different) executions<br>

for the same source and inputs.  Running with -fsanitize=undefined<br>

will detect problems just on the executions that the current compiler<br>

implementation happens to generate.  Of course, checking against all<br>

allowed executions of a very loose spec quickly becomes<br>

combinatorially infeasible,</blockquote><div><br></div><div>Actually, I would be very, very happy to have an O(2^N) (or worse) algorithm for checking all allowed executions ;)</div><div><br></div><div>(the problem is actually undecidable; not just "infeasible")</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> so this isn't unreasonable, but at lease<br>

we'd like to have that gold standard precisely defined, and to be able<br>

to pseudorandomly check against it.<br></blockquote><div><br></div><div>It sounds like what you are wanting is to basically make a list of undefined behaviors in the standard and find out which ones, in practice, should be demoted to unspecified or implementation-defined?</div><div><br></div><div>-- Sean Silva</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

thanks,<br>

Peter<br>

<div class="HOEnZb"><div class="h5"><br>

<br>

<br>

<br>

> <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__clang.llvm.org_docs_UsersManual.html-23controlling-2Dcode-2Dgeneration&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=Ny473GDW_-fudG0I3mSZwEQov1NUqtPWPyOVTj7eJkE&s=HG_P5beTyTV9SWrc02v7AYNCehPByS0a7yzsbA9vdDM&e=" rel="noreferrer" target="_blank">http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation</a><br>

><br>

>> so that basic development technique does not -<br>

>> almost, cannot - result in code that is robust against compilers that<br>

>> sometimes exploit a wide range of that behaviour.<br>

>><br>

>> It's also the case that some of the looseness of ISO C relates to<br>

>> platforms that are no longer relevant, or at least no longer<br>

>> prevalent.  We can at least identify C dialects that provide stronger<br>

>> guarantees for the rest.<br>

>><br>

>> thanks,<br>

>> Peter<br>

>><br>

>><br>

>>> This is obviously sort of a twisted logic, and I think that a lot of the<br>

>>> "malevolence" attributed to compilers is due to this. It certainly removes<br>

>>> many target-dependent checks from the mid-level optimizer though.<br>

><br>

</div></div></blockquote></div><br></div></div>