[LLVMdev] C as used/implemented in practice: analysis of responses

Fri Jun 26 17:02:23 PDT 2015

On 26 June 2015 at 22:53, Sean Silva <chisophugis at gmail.com> wrote:
> All of these seem to fall into the pattern of "The compiler is required to
> do what you expect, as long as it can't prove X about your program". That
> is, the only reasonable compilation in the absence of inferring some extra
> piece of information about your program, is the one you expect. For example,
> the only way to codegen a comparison between two random pointers has the
> meaning you expect (on common computer architectures); but if the compiler
> can figure something out that tells it that comparing those two pointers is
> undefined by the language standard, then, well, technically it can do
> whatever it wants.
>
> Many people interpret this as the compiler being somewhat malevolent, but
> there's another interpretation in some cases.
>
> I have not looked in depth at the history in all the undefined behaviors
> mentioned in the survey, but some of the undefined behaviors are there
> because at some point in time the underlying system diversity made it
> difficult or impossible to assign a meaning. So long as the diversity that
> led to the desire to leave something undefined still exists, programs that
> use those constructs with certain expectations *will* fail to behave as
> "expected" on those targets (on a system where pointers are represented
> differently, your program *may* actually format your hard disk if you do
> so-and-so!).
>
> To put it another way, what is "expected" is actually dependent on the C
> programmer's knowledge of the underlying system (computer architecture,
> system architecture, etc.), and there will always be tension so long as the
> programmer is not thinking about what the C language guarantees, but rather
> (roughly speaking) how *they* would translate their code to assembly
> language for the system or systems that they happen to know they're
> targeting. An x86 programmer doesn't expect unaligned loads to invoke nasal
> demons, but a SPARC programmer does.
>
> So if you unravel the thread of logic back through the undefined behaviors
> made undefined for this reason, many of these cases of exploiting undefined
> behavior are really an extension, on the compiler's part, of the logic
> "there are some systems for which your code would invoke nasal demons, so I
> might as well assume that it will invoke nasal demons on this system (since
> the language standard doesn't say anything about specific systems)". Or to
> put it another way, the compiler is effectively assuming that your code is
> written to target all the systems taken into account by the C standard, and
> if it would invoke nasal demons on any one of them then the compiler is
> allowed to invoke nasal demons on all of them.

Sure.  However, we think we have to take seriously the fact that a
large body of critical code out there is *not* written to target what
the C standard is now, and it is very unlikely to be rewritten to do
so.

At the end of the day, code is not written purely by "thinking about
what the C language guarantees", but rather by test-and-debug cycles
that test the code against the behaviour of particular C
implementations.  The ISO C standard is a very loose specification,
and we do not have good tools for testing code against all the
behaviour it permits, so that basic development technique does not -
almost, cannot - result in code that is robust against compilers that
sometimes exploit a wide range of that behaviour.

It's also the case that some of the looseness of ISO C relates to
platforms that are no longer relevant, or at least no longer
prevalent.  We can at least identify C dialects that provide stronger
guarantees for the rest.

thanks,
Peter

> This is obviously sort of a twisted logic, and I think that a lot of the
> "malevolence" attributed to compilers is due to this. It certainly removes
> many target-dependent checks from the mid-level optimizer though.
>
> -- Sean Silva
>
> On Fri, Jun 26, 2015 at 1:42 AM, Peter Sewell <Peter.Sewell at cl.cam.ac.uk>
> wrote:
>>
>> As part of a project to clarify what behaviour of C implementations is
>> actually relied upon in modern practice, and what behaviour is
>> guaranteed by current mainstream implementations, we recently
>> distributed a survey of 15 questions about C, https://goo.gl/AZXH3S.
>>
>> We were asking what C is in current mainstream practice: the behaviour
>> that programmers assume they can rely on, the behaviour provided by
>> mainstream compilers, and the idioms used in existing code, especially
>> systems code.  We were *not* asking what the ISO C standard permits,
>> which is often more restrictive, or about obsolete or obscure hardware
>> or compilers.  We focussed on the behaviour of memory and pointers.
>>
>> We've had around 300 responses, including many compiler and OS
>> developers, and the results are summarised below, or on the web at
>> http://www.cl.cam.ac.uk/~pes20/cerberus (which also has more details).
>> For many questions the outcome seems clear, but for some, especially
>> 1, 2, 9, 10, and 11, major open questions about current compiler
>> behaviour remain; we'd greatly appreciate informed comments on those
>> from the relevant compiler developers (or other experts).
>>
>> If you can answer these, please reply either below or by mailing the
>> Cerberus mailing list:
>>
>>   cl-cerberus at lists.cam.ac.uk
>>
>>   https://lists.cam.ac.uk/mailman/listinfo/cl-cerberus
>>
>> many thanks,
>> Kayvan Memarian and Peter Sewell  (University of Cambridge)
>>
>>
>> What is C in practice? (Cerberus survey):  Conclusions
>> ======================================================================
>>
>> ##           Kayvan Memarian and Peter Sewell
>> ###                University of Cambridge, 2015-06-21
>>
>> ----------------------------------------------------------------------
>>
>> In April-June 2015 we distributed a web survey to investigate what C
>> is, in current mainstream practice: the behaviour that programmers
>> assume they can rely on, the behaviour provided by mainstream
>> compilers, and the idioms used in existing code, especially systems
>> code.  We were *not* asking what the ISO C standard permits, which is
>> often more restrictive, or about obsolete or obscure hardware or
>> compilers.  We focussed on the behaviour of memory and pointers.
>> This is a step towards an unambiguous and mathematically precise
>> definition
>> of the consensus *de facto* standard (to the extent that such
>> exists): the C that is actually used by systems programmers and
>> implemented by mainstream compilers.
>>
>> This note summarises our conclusions. For many questions the outcome
>> seems clear, but for some, especially 1, 2, 9, 10, and 11, major open
>> questions about current compiler behaviour remain; we'd greatly
>> appreciate further comments from the relevant compiler developers or
>> other experts.
>>
>> More details of the results are available from
>>
>> [http://www.cl.cam.ac.uk/~pes20/cerberus](http://www.cl.cam.ac.uk/~pes20/cerberus).
>>
>> ### Responses
>>
>> Aiming for a modest-scale but technically expert audience, we
>> distributed the survey at the University of Cambridge systems research
>> group, at EuroLLVM 2015, via John Regehr's blog, and via various
>> mailing lists: gcc, llvmdev, cfe-dev, libc-alpha, xorg, a FreeBSD
>> list, xen-devel, a Google C users list, and a Google C compilers list.
>> In all there were around 300 responses, including around 100 printed
>> pages of textual comments.  Most report expertise in C systems
>> programming and significant numbers report expertise in compiler
>> internals and in the C standard.
>>
>>
>> MAIN QUESTION RESPONSES
>> -----------------------
>>
>> ###[1/15] How predictable are reads from padding bytes?
>>
>> **If you zero all bytes of a struct and then write some of its
>> members, do reads of the padding return zero?  (e.g. for a bytewise
>> CAS or hash of the struct, or to know that no security-relevant data
>> has leaked into them.)**
>>
>> It remains unclear what behaviour compilers currently provide (or
>> should provide) for this.  We see four main alternatives:
>>
>> a) Structure copies might copy padding, but structure member writes
>> never touch padding.
>>
>> b) Structure member writes might write zeros over subsequent padding.
>>
>> c) Structure member writes might write arbitrary values over subsequent
>> padding, with reads seeing stable results.
>>
>> d) Padding bytes are regarded as always holding unspecified values,
>> irrespective of any byte writes to them, and so reads of them might
>> return arbitrary and unstable values (in which case the compiler could
>> arbitrarily write to padding at any point, as that would be masked by
>> this).
>>
>> On the one hand:
>>
>> - In some circumstances it seems important to provide systems
>>    programmers with a mechanism to ensure that no information is
>>    leaked via padding. Rewriting structure definitions to make all
>>    padding into explicit fields may not be practicable, especially if
>>    one wants to do so in a platform-independent way, and so option (d)
>>    is not compatible with this.  Option (c) makes it possible but
>>    awkward to prevent leakage, as there padding must be re-zero'd
>>    after member writes.
>>
>> - In some circumstances programmers may rely on predictable padding
>>    values, at least in the absence of structure member writes,
>>    e.g. for memcmp, hashing, or compare-and-swap of struct values.
>>    Again (d) is not compatible with this, and (a) or (b) are
>>    preferable.  But it's not clear whether any of those usages are
>>    common or essential.
>>
>> - For Clang, one respondent suggests that by the time the optimisation
>>    passes operate, padding has been replaced by explicit fields, so
>>    neither over-wide writes or permanently-undefined-value behaviour
>>    will occur.
>>
>> - For MSVC, one respondent suggests the compiler provides (a).
>>
>> On the other hand, others suggest plausible optimisations, e.g.,
>> "to apply SRA (scalar replacement of aggregates), replacing the
>> memset with a sequence of member assignments (discarding assignments
>> to padding) in order to do so".  This could require (d) to make the
>> existing compiler behaviour admissible, but it's unclear
>> to us whether it actually does at present.
>>
>> Note also that for structs stored to malloc'd regions, option (d) is
>> at odds with the idea that malloc'd regions can be reused, as a write
>> of a new value (of a new type) to a malloc'd region might start with
>> writes to what were padding bytes, so perhaps we could only really
>> have this semantics for the other storage-duration kinds.
>>
>> Conclusion:
>>
>> For each compiler (GCC, Clang, MSVC, ICC, ...), we ask which of these
>> it provides on mainstream platforms?  If the answer is not (a), to
>> what extent is it feasible to provide compiler flags that force the
>> behaviour to be stronger?
>>
>> Our "mainstream C" semantics should provide switchable options for
>> each of (a), (b), (c), and (d).
>>
>>
>>
>> ### [2/15] Uninitialised values
>>
>> **Is reading an uninitialised variable or struct member (with a
>> current mainstream compiler):**
>>
>> **(This might either be due to a bug or be intentional, e.g. when
>> copying a partially initialised struct, or to output, hash, or set
>> some bits of a value that may have been partially initialised.)**
>>
>> a) undefined behaviour (meaning that the compiler is free to
>> arbitrarily miscompile the program, with or without a warning),
>>
>> b) going to make the result of any expression involving that value
>> unpredictable,
>>
>> c) going to give an arbitrary and unstable value (maybe with a
>> different value if you read again), or
>>
>> d) going to give an arbitrary but stable value (with the same value if
>> you read again).
>>
>> Here also it remains unclear what compilers current provide and what
>> it should provide.  The survey responses are dominated by the
>> (a) "undefined behaviour" and (d) "arbitrary but stable" options.
>>
>> It's not clear whether people are actually depending on the latter,
>> beyond the case of copying a partially initialised struct, which it
>> seems must be supported, and comparing against a partially initialised
>> struct, which it seems is done sometimes.  Many respondents mention
>> historical uses to attempt to get entropy, but that seems now widely
>> regarded as crazy.  There is a legitimate general argument that the
>> more determinacy that can be provided the better, for debugging.
>>
>> But it seems clear from the text responses that GCC, Clang, and MSVC
>> do not at present exploit the licence the ISO standard gives (in
>> defining this to be undefined behaviour) to arbitrarily miscompile
>> code, option (a).  Clang seems to be the most aggressive, propagating
>> undef in many cases, though one respondent said "LLVM is moving
>> towards treating this as UB in the cases where the standards allow it
>> to do so".  But there are special cases where LLVM is a bit stronger
>> (cf the undef docs); it's unclear why they think those are useful.
>> For GCC, one respondent said
>>
>>   "Going to give arbitrary, unstable values (that is, the variable
>>   assigned from the uninitialised variable itself acts as
>>   uninitialised and having no consistent value).  (Quite possibly
>>   subsequent transformations will have the effect of undefined
>>   behavior.)  Inconsistency of observed values is an inevitable
>>   consequence of transformations PHI (undefined, X) -> X (useful in
>>   practice for programs that don't actually use uninitialised
>>   variables, but where the compiler can't see that)."
>>
>> For MSVC, one respondent said:
>>
>>   "I am aware of a significant divergence between the LLVM community
>>   and MSVC here; in general LLVM uses "undefined behaviour" to mean
>>   "we can miscompile the program and get better benchmarks", whereas
>>   MSVC regards "undefined behaviour" as "we might have a security
>>   vulnerability so this is a compile error / build break".  First,
>>   there is reading an uninitialized variable (i.e. something which
>>   does not necessarily have a memory location); that should always be
>>   a compile error.  Period.  Second, there is reading a partially
>>   initialised struct (i.e. reading some memory whose contents are only
>>   partly defined).  That should give a compile error/warning or static
>>   analysis warning if detectable.  If not detectable it should give
>>   the actual contents of the memory (be stable).  I am strongly with
>>   the MSVC folks on this one - if the compiler can tell at compile
>>   time that anything is undefined then it should error out.  Security
>>   problems are a real problem for the whole industry and should not be
>>   included deliberately by compilers."
>>
>>
>> For each compiler we ask which of these four semantics it provides.
>>
>> It looks as if several compiler writers are saying (b), while a
>> significant number of programmers are relying on (d).
>>
>> Our "mainstream C" semantics should support both (b) and (d).
>>
>>
>>
>> ### [3/15] Can one use pointer arithmetic between separately allocated
>> C objects?
>>
>> **If you calculate an offset between two separately allocated C memory
>> objects (e.g. malloc'd regions or global or local variables) by
>> pointer subtraction, can you make a usable pointer to the second by
>> adding the offset to the address of the first?**
>>
>> Most respondents expect this to work, and a significant number know of
>> real code that relies on it, with many concrete examples.  So far we
>> see no solid reason to disallow it for "mainstream" C implementations.
>> The main reason to disallow it seems to have been segmented
>> architectures, especially 8086.  There are still some embedded
>> architectures with distinct address spaces but it seems that
>> "mainstream" C is not concerned with this, and those cases could be
>> identified as a language dialect or implementation-defined choice.
>>
>> For GCC, one respondent writes the following, but doesn't give a reason:
>>
>> - This is not safe in practice even if the alignment is sufficient
>>   (and if the alignment of the type is less than its size, obviously
>>   such a subtraction can't possibly work even with a naive compiler).
>>
>> It looks reasonable to identify two language dialects, one (our
>> "mainstream C") in which pointer arithmetic between separately
>> allocated C objects is permitted and one (principally for segmented
>> architectures) in which it is not.
>>
>>
>>
>> ### [4/15] Is pointer equality sensitive to their original allocation
>> sites?
>>
>> **For two pointers derived from the addresses of two separate
>> allocations, will equality testing (with ==) of them just compare
>> their runtime values, or might it take their original allocations into
>> account and assume that they do not alias, even if they happen to have
>> the same runtime value? (for current mainstream compilers)**
>>
>> The responses are roughly bimodal: many believe "it will just compare
>> the runtime values", while a similar number believe that the
>> comparison might take the allocation provenance into account.
>> Of the former, 41 "know of real code that relies on it".
>>
>> In practice we see that GCC does sometimes take allocation provenance
>> into account, with the result of a comparison (in an n+1 case,
>> comparing &p+1 and &q) sometimes varying depending on whether the
>> compiler can see the provenance, e.g. on whether it's done in the same
>> compilation unit as the allocation.  We don't see any reason to forbid
>> that, especially as this n+1 case seems unlikely to arise in practice,
>> though it does complicate the semantics, effectively requiring a
>> nondeterministic choice at each comparison of whether to take
>> provenance into account.  But for comparisons between pointers formed
>> by more radical pointer arithmetic from pointers originally from
>> different allocations, as in Q3, it's not so clear.
>>
>> The best "mainstream C" semantics here seems to be to make a
>> nondeterministic choice at each comparison of whether to take
>> provenance into account or just compare the runtime pointer value.  In
>> the vast majority of cases the two will coincide.
>>
>>
>>
>> ### [5/15] Can pointer values be copied indirectly?
>>
>> **Can you make a usable copy of a pointer by copying its
>> representation bytes with code that indirectly computes the identity
>> function on them, e.g. writing the pointer value to a file and then
>> reading it back, and using compression or encryption on the way?**
>>
>> This seems unequivocably "yes", both in usage and for current
>> compilers, at least in simple cases, with direct data-flow from
>> original to computed pointer, both GCC and Clang support this.  But
>> for computation via control-flow, it's not so clear.
>>
>> It looks as if a reasonable "mainstream C" semantics should allow
>> indirect pointer copying at least whenever there's a data-flow
>> provenance path, perhaps not when there's only a control-flow
>> provenance path.  It should allow pointers to be marshalled to and
>> read back in, and the simplest way of doing that is to allow any
>> pointer value to be read in, with the compiler making no
>> aliasing/provenance assumptions on such, and with the semantics
>> checking the numeric pointer values points to a suitable live object
>> only when and if it is dereferenced.
>>
>>
>>
>> ###[6/15] Pointer comparison at different types
>>
>> **Can one do == comparison between pointers to objects of different
>> types (e.g. pointers to int, float, and different struct types)?**
>>
>> The question should have been clearer about whether the pointers are
>> first cast to void* or char*.  With those casts, the responses seem
>> clear that it should be allowed, modulo now-unusual architectures with
>> segmented memory or where the pointer representations are different.
>>
>> Then there's a question, which we would hope applies only in the case
>> without those casts, about whether -fstrict-aliasing will treat
>> comparisons with type mismatches as nonequal, e.g.
>>
>> - "Depends on strict-aliasing flags? I think LLVM TBAA might optimise
>>   this sort of check away?"
>>
>> - "There are a lot of examples of this, in particular in libc, or
>>   possibly implementations of vtables."
>>
>>
>>
>>
>> ### [7/15] Pointer comparison across different allocations
>>
>> **Can one do < comparison between pointers to separately allocated
>> objects?**
>>
>> This seems to be widely used for lock ordering and collections.
>> As for Q3, there's a potential issue for segmented memory systems
>> (where the implementation might only compare the offset) which seems
>> not to be relevant for current "mainstream" C.
>>
>> Apart from that, there doesn't seem to be any reason from compiler
>> implementation to forbid it.
>>
>>
>>
>> ###[8/15] Pointer values after lifetime end
>>
>> **Can you inspect (e.g. by comparing with ==) the value of a pointer
>> to an object after the object itself has been free'd or its scope has
>> ended?**
>>
>> The responses mostly say that this will work (modulo debugging
>> environments), with various use cases, and there is no clear reason
>> why compilers do not support it.
>>
>> Can we either establish that current mainstream compilers will support
>> this or identify more specifically where and how they will fail to do
>> so?
>>
>>
>> ### [9/15] Pointer arithmetic
>>
>> **Can you (transiently) construct an out-of-bounds pointer value (e.g.
>> before the beginning of an array, or more than one-past its end) by
>> pointer arithmetic, so long as later arithmetic makes it in-bounds
>> before it is used to access memory?**
>>
>> It seems clear that this is often assumed to work. But on the other
>> hand, compilers may sometimes assume otherwise:
>>
>> - "This is not safe; compilers may optimise based on pointers being
>>   within bounds.  In some cases, it's possible such code might not
>>   even link, depending on the offsets allowed in any relocations that
>>   get used in the object files."
>>
>> - "The situation has not gotten friendlier to old-school pointer
>>   manipulations since https://lwn.net/Articles/278137/ was written in
>>   2008.  [This is a case where GCC optimised away a comparison
>>   involving an out-of-bounds pointer] The pattern could still be found
>>   in code exposed to malicious interlocutors in 2013:
>>   https://access.redhat.com/security/cve/CVE-2013-5607"
>>
>> - "Pretty sure this one I've seen buggy code optimised away by real
>>   compilers."
>>
>> Here the prevalence of transiently out-of-bounds pointer values in
>> real code suggests it's worth seriously asking the cost of disabling
>> whatever compiler optimisation is done based on this, to provide a
>> simple predictable semantics.
>>
>>
>>
>> ### [10/15] Pointer casts
>>
>> **Given two structure types that have the same initial members, can
>> you use a pointer of one type to access the intial members of a value
>> of the other?**
>>
>> It's clear that this is used very commonly.  On the other hand, with
>> strict aliasing:
>>
>> - "This is something that GCC tends to actually kill in practice (if
>>   strict aliasing is on); I've had to fix bugs that were caused by
>>   it."
>>
>> and w.r.t. GCC:
>>
>> - "This is not safe in practice (unless a union is visibly
>>   used as described in 6.5.2.3#6)."
>>
>> though the latter doesn't say why, or whether that's specific to
>> strict-aliasing.
>>
>> At least with -no-strict-aliasing, it seems this should be guaranteed
>> to work.
>>
>>
>>
>> ### [11/15] Using unsigned char arrays
>>
>> **Can an unsigned character array be used (in the same way as a
>> malloc’d region) to hold values of other types?**
>>
>> Here again it's clear that it's very often relied on for statically
>> allocated (non-malloc'd) character arrays, and it should work, with
>> due care about alignment. But the ISO standard disallows it, and we
>> also see:
>>
>> - [wrt GCC] "No, this is not safe (if it's visible to the compiler
>>   that the memory in question has unsigned char as its declared
>>   type)."
>>
>> though the latter doesn't say why.   It is a violation of the
>> strict-aliasing text of the ISO standard.
>>
>> With -no-strict-aliasing it seems clear it that it should be allowed.
>>
>>
>>
>> ### [12/15] Null pointers from non-constant expressions
>>
>> **Can you make a null pointer by casting from an expression that isn't
>> a constant but that evaluates to 0?**
>>
>> This is very often assumed to work.  The only exception seems to be
>> some (unidentified) embedded systems.
>>
>> Our "mainstream C" semantics should permit it.
>>
>>
>>
>> ### [13/15] Null pointer representations
>>
>> **Can null pointers be assumed to be represented with 0?**
>>
>> Basically an unequivocal "yes" for mainstream systems.  Again there a
>> potential exception for segmented memory, but not one relevant for
>> "mainstream" current practice.
>>
>>
>>
>> ### [14/15] Overlarge representation reads
>>
>> **Can one read the byte representation of a struct as aligned words
>> without regard for the fact that its extent might not include all of
>> the last word?**
>>
>> This is sometimes used in practice and believed to work, modulo
>> alignment, page-boundary alignment, and valgrind/MSan/etc.
>> Our "mainstream C" semantics could either forbid this entirely
>> (slightly limiting the scope of the semantics) or could allow it, for
>> sufficiently aligned cases, if some switch is set.
>>
>> For the moment we choose the former.
>>
>> ### [15/15] Union type punning
>>
>> **When is type punning - writing one union member and then reading it
>> as a different member, thereby reinterpreting its representation bytes
>> - guaranteed to work (without confusing the compiler analysis and
>> optimisation passes)?**
>>
>> There's widespread doubt, disagreement and confusion here, e.g.:
>>
>> - "always"
>> - "Never"
>> - "According to the standard never; in practice always."
>>
>> Here the minimal thing it seems we should support is, broadly
>> following the GCC documentation, type punning via a union whose
>> definition is in scope and which is accessed via l-values that
>> manifestly involve the union type.
>>
>> Or, in the -no-strict-aliasing case, one could allow it everywhere.
>>
>>
>> POSTAMBLE RESPONSES
>> ===================
>>
>> ### Other differences
>>
>> ** If you know of other areas where the C used in practice differs
>> from that of the ISO standard, or where compiler optimisation limits
>> the behaviour you can rely on, please list them. **
>>
>> There were many comments here which are hard to summarise. Many
>> mention integer overflow and the behaviour of shift operators.
>>
>>
>>
>>
>> ### Acknowledgements
>>
>> We would like to thank all those who responded to the survey, those
>> who distributed it, and especially those members of the Cambridge
>> Systems Research Group who helped us tune earlier versions.  This work
>> is funded by the EPSRC REMS (Rigorous Engineering for Mainstream
>> Systems) Programme Grant, EP/K008528/1.
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>