[LLVMdev] C as used/implemented in practice: analysis of responses

Fri Jun 26 14:53:30 PDT 2015

All of these seem to fall into the pattern of "The compiler is required to
do what you expect, as long as it can't prove X about your program". That
is, the only reasonable compilation in the absence of inferring some extra
piece of information about your program, is the one you expect. For
example, the only way to codegen a comparison between two random pointers
has the meaning you expect (on common computer architectures); but if the
compiler can figure something out that tells it that comparing those two
pointers is undefined by the language standard, then, well, technically it
can do whatever it wants.

Many people interpret this as the compiler being somewhat malevolent, but
there's another interpretation in some cases.

I have not looked in depth at the history in all the undefined behaviors
mentioned in the survey, but some of the undefined behaviors are there
because at some point in time the underlying system diversity made it
difficult or impossible to assign a meaning. So long as the diversity that
led to the desire to leave something undefined still exists, programs that
use those constructs with certain expectations *will* fail to behave as
"expected" on those targets (on a system where pointers are represented
differently, your program *may* actually format your hard disk if you do
so-and-so!).

To put it another way, what is "expected" is actually dependent on the C
programmer's knowledge of the underlying system (computer architecture,
system architecture, etc.), and there will always be tension so long as the
programmer is not thinking about what the C language guarantees, but rather
(roughly speaking) how *they* would translate their code to assembly
language for the system or systems that they happen to know they're
targeting. An x86 programmer doesn't expect unaligned loads to invoke nasal
demons, but a SPARC programmer does.

So if you unravel the thread of logic back through the undefined behaviors
made undefined for this reason, many of these cases of exploiting undefined
behavior are really an extension, on the compiler's part, of the logic
"there are some systems for which your code would invoke nasal demons, so I
might as well assume that it will invoke nasal demons on this system (since
the language standard doesn't say anything about specific systems)". Or to
put it another way, the compiler is effectively assuming that your code is
written to target all the systems taken into account by the C standard, and
if it would invoke nasal demons on any one of them then the compiler is
allowed to invoke nasal demons on all of them.

This is obviously sort of a twisted logic, and I think that a lot of the
"malevolence" attributed to compilers is due to this. It certainly removes
many target-dependent checks from the mid-level optimizer though.

-- Sean Silva

On Fri, Jun 26, 2015 at 1:42 AM, Peter Sewell <Peter.Sewell at cl.cam.ac.uk>
wrote:

> As part of a project to clarify what behaviour of C implementations is
> actually relied upon in modern practice, and what behaviour is
> guaranteed by current mainstream implementations, we recently
> distributed a survey of 15 questions about C, https://goo.gl/AZXH3S.
>
> We were asking what C is in current mainstream practice: the behaviour
> that programmers assume they can rely on, the behaviour provided by
> mainstream compilers, and the idioms used in existing code, especially
> systems code.  We were *not* asking what the ISO C standard permits,
> which is often more restrictive, or about obsolete or obscure hardware
> or compilers.  We focussed on the behaviour of memory and pointers.
>
> We've had around 300 responses, including many compiler and OS
> developers, and the results are summarised below, or on the web at
> http://www.cl.cam.ac.uk/~pes20/cerberus (which also has more details).
> For many questions the outcome seems clear, but for some, especially
> 1, 2, 9, 10, and 11, major open questions about current compiler
> behaviour remain; we'd greatly appreciate informed comments on those
> from the relevant compiler developers (or other experts).
>
> If you can answer these, please reply either below or by mailing the
> Cerberus mailing list:
>
>   cl-cerberus at lists.cam.ac.uk
>
>   https://lists.cam.ac.uk/mailman/listinfo/cl-cerberus
>
> many thanks,
> Kayvan Memarian and Peter Sewell  (University of Cambridge)
>
>
> What is C in practice? (Cerberus survey):  Conclusions
> ======================================================================
>
> ##           Kayvan Memarian and Peter Sewell
> ###                University of Cambridge, 2015-06-21
>
> ----------------------------------------------------------------------
>
> In April-June 2015 we distributed a web survey to investigate what C
> is, in current mainstream practice: the behaviour that programmers
> assume they can rely on, the behaviour provided by mainstream
> compilers, and the idioms used in existing code, especially systems
> code.  We were *not* asking what the ISO C standard permits, which is
> often more restrictive, or about obsolete or obscure hardware or
> compilers.  We focussed on the behaviour of memory and pointers.
> This is a step towards an unambiguous and mathematically precise definition
> of the consensus *de facto* standard (to the extent that such
> exists): the C that is actually used by systems programmers and
> implemented by mainstream compilers.
>
> This note summarises our conclusions. For many questions the outcome
> seems clear, but for some, especially 1, 2, 9, 10, and 11, major open
> questions about current compiler behaviour remain; we'd greatly
> appreciate further comments from the relevant compiler developers or
> other experts.
>
> More details of the results are available from
> [
> http://www.cl.cam.ac.uk/~pes20/cerberus](http://www.cl.cam.ac.uk/~pes20/cerberus)
> .
>
> ### Responses
>
> Aiming for a modest-scale but technically expert audience, we
> distributed the survey at the University of Cambridge systems research
> group, at EuroLLVM 2015, via John Regehr's blog, and via various
> mailing lists: gcc, llvmdev, cfe-dev, libc-alpha, xorg, a FreeBSD
> list, xen-devel, a Google C users list, and a Google C compilers list.
> In all there were around 300 responses, including around 100 printed
> pages of textual comments.  Most report expertise in C systems
> programming and significant numbers report expertise in compiler
> internals and in the C standard.
>
>
> MAIN QUESTION RESPONSES
> -----------------------
>
> ###[1/15] How predictable are reads from padding bytes?
>
> **If you zero all bytes of a struct and then write some of its
> members, do reads of the padding return zero?  (e.g. for a bytewise
> CAS or hash of the struct, or to know that no security-relevant data
> has leaked into them.)**
>
> It remains unclear what behaviour compilers currently provide (or
> should provide) for this.  We see four main alternatives:
>
> a) Structure copies might copy padding, but structure member writes
> never touch padding.
>
> b) Structure member writes might write zeros over subsequent padding.
>
> c) Structure member writes might write arbitrary values over subsequent
> padding, with reads seeing stable results.
>
> d) Padding bytes are regarded as always holding unspecified values,
> irrespective of any byte writes to them, and so reads of them might
> return arbitrary and unstable values (in which case the compiler could
> arbitrarily write to padding at any point, as that would be masked by
> this).
>
> On the one hand:
>
> - In some circumstances it seems important to provide systems
>    programmers with a mechanism to ensure that no information is
>    leaked via padding. Rewriting structure definitions to make all
>    padding into explicit fields may not be practicable, especially if
>    one wants to do so in a platform-independent way, and so option (d)
>    is not compatible with this.  Option (c) makes it possible but
>    awkward to prevent leakage, as there padding must be re-zero'd
>    after member writes.
>
> - In some circumstances programmers may rely on predictable padding
>    values, at least in the absence of structure member writes,
>    e.g. for memcmp, hashing, or compare-and-swap of struct values.
>    Again (d) is not compatible with this, and (a) or (b) are
>    preferable.  But it's not clear whether any of those usages are
>    common or essential.
>
> - For Clang, one respondent suggests that by the time the optimisation
>    passes operate, padding has been replaced by explicit fields, so
>    neither over-wide writes or permanently-undefined-value behaviour
>    will occur.
>
> - For MSVC, one respondent suggests the compiler provides (a).
>
> On the other hand, others suggest plausible optimisations, e.g.,
> "to apply SRA (scalar replacement of aggregates), replacing the
> memset with a sequence of member assignments (discarding assignments
> to padding) in order to do so".  This could require (d) to make the
> existing compiler behaviour admissible, but it's unclear
> to us whether it actually does at present.
>
> Note also that for structs stored to malloc'd regions, option (d) is
> at odds with the idea that malloc'd regions can be reused, as a write
> of a new value (of a new type) to a malloc'd region might start with
> writes to what were padding bytes, so perhaps we could only really
> have this semantics for the other storage-duration kinds.
>
> Conclusion:
>
> For each compiler (GCC, Clang, MSVC, ICC, ...), we ask which of these
> it provides on mainstream platforms?  If the answer is not (a), to
> what extent is it feasible to provide compiler flags that force the
> behaviour to be stronger?
>
> Our "mainstream C" semantics should provide switchable options for
> each of (a), (b), (c), and (d).
>
>
>
> ### [2/15] Uninitialised values
>
> **Is reading an uninitialised variable or struct member (with a
> current mainstream compiler):**
>
> **(This might either be due to a bug or be intentional, e.g. when
> copying a partially initialised struct, or to output, hash, or set
> some bits of a value that may have been partially initialised.)**
>
> a) undefined behaviour (meaning that the compiler is free to
> arbitrarily miscompile the program, with or without a warning),
>
> b) going to make the result of any expression involving that value
> unpredictable,
>
> c) going to give an arbitrary and unstable value (maybe with a
> different value if you read again), or
>
> d) going to give an arbitrary but stable value (with the same value if
> you read again).
>
> Here also it remains unclear what compilers current provide and what
> it should provide.  The survey responses are dominated by the
> (a) "undefined behaviour" and (d) "arbitrary but stable" options.
>
> It's not clear whether people are actually depending on the latter,
> beyond the case of copying a partially initialised struct, which it
> seems must be supported, and comparing against a partially initialised
> struct, which it seems is done sometimes.  Many respondents mention
> historical uses to attempt to get entropy, but that seems now widely
> regarded as crazy.  There is a legitimate general argument that the
> more determinacy that can be provided the better, for debugging.
>
> But it seems clear from the text responses that GCC, Clang, and MSVC
> do not at present exploit the licence the ISO standard gives (in
> defining this to be undefined behaviour) to arbitrarily miscompile
> code, option (a).  Clang seems to be the most aggressive, propagating
> undef in many cases, though one respondent said "LLVM is moving
> towards treating this as UB in the cases where the standards allow it
> to do so".  But there are special cases where LLVM is a bit stronger
> (cf the undef docs); it's unclear why they think those are useful.
> For GCC, one respondent said
>
>   "Going to give arbitrary, unstable values (that is, the variable
>   assigned from the uninitialised variable itself acts as
>   uninitialised and having no consistent value).  (Quite possibly
>   subsequent transformations will have the effect of undefined
>   behavior.)  Inconsistency of observed values is an inevitable
>   consequence of transformations PHI (undefined, X) -> X (useful in
>   practice for programs that don't actually use uninitialised
>   variables, but where the compiler can't see that)."
>
> For MSVC, one respondent said:
>
>   "I am aware of a significant divergence between the LLVM community
>   and MSVC here; in general LLVM uses "undefined behaviour" to mean
>   "we can miscompile the program and get better benchmarks", whereas
>   MSVC regards "undefined behaviour" as "we might have a security
>   vulnerability so this is a compile error / build break".  First,
>   there is reading an uninitialized variable (i.e. something which
>   does not necessarily have a memory location); that should always be
>   a compile error.  Period.  Second, there is reading a partially
>   initialised struct (i.e. reading some memory whose contents are only
>   partly defined).  That should give a compile error/warning or static
>   analysis warning if detectable.  If not detectable it should give
>   the actual contents of the memory (be stable).  I am strongly with
>   the MSVC folks on this one - if the compiler can tell at compile
>   time that anything is undefined then it should error out.  Security
>   problems are a real problem for the whole industry and should not be
>   included deliberately by compilers."
>
>
> For each compiler we ask which of these four semantics it provides.
>
> It looks as if several compiler writers are saying (b), while a
> significant number of programmers are relying on (d).
>
> Our "mainstream C" semantics should support both (b) and (d).
>
>
>
> ### [3/15] Can one use pointer arithmetic between separately allocated
> C objects?
>
> **If you calculate an offset between two separately allocated C memory
> objects (e.g. malloc'd regions or global or local variables) by
> pointer subtraction, can you make a usable pointer to the second by
> adding the offset to the address of the first?**
>
> Most respondents expect this to work, and a significant number know of
> real code that relies on it, with many concrete examples.  So far we
> see no solid reason to disallow it for "mainstream" C implementations.
> The main reason to disallow it seems to have been segmented
> architectures, especially 8086.  There are still some embedded
> architectures with distinct address spaces but it seems that
> "mainstream" C is not concerned with this, and those cases could be
> identified as a language dialect or implementation-defined choice.
>
> For GCC, one respondent writes the following, but doesn't give a reason:
>
> - This is not safe in practice even if the alignment is sufficient
>   (and if the alignment of the type is less than its size, obviously
>   such a subtraction can't possibly work even with a naive compiler).
>
> It looks reasonable to identify two language dialects, one (our
> "mainstream C") in which pointer arithmetic between separately
> allocated C objects is permitted and one (principally for segmented
> architectures) in which it is not.
>
>
>
> ### [4/15] Is pointer equality sensitive to their original allocation
> sites?
>
> **For two pointers derived from the addresses of two separate
> allocations, will equality testing (with ==) of them just compare
> their runtime values, or might it take their original allocations into
> account and assume that they do not alias, even if they happen to have
> the same runtime value? (for current mainstream compilers)**
>
> The responses are roughly bimodal: many believe "it will just compare
> the runtime values", while a similar number believe that the
> comparison might take the allocation provenance into account.
> Of the former, 41 "know of real code that relies on it".
>
> In practice we see that GCC does sometimes take allocation provenance
> into account, with the result of a comparison (in an n+1 case,
> comparing &p+1 and &q) sometimes varying depending on whether the
> compiler can see the provenance, e.g. on whether it's done in the same
> compilation unit as the allocation.  We don't see any reason to forbid
> that, especially as this n+1 case seems unlikely to arise in practice,
> though it does complicate the semantics, effectively requiring a
> nondeterministic choice at each comparison of whether to take
> provenance into account.  But for comparisons between pointers formed
> by more radical pointer arithmetic from pointers originally from
> different allocations, as in Q3, it's not so clear.
>
> The best "mainstream C" semantics here seems to be to make a
> nondeterministic choice at each comparison of whether to take
> provenance into account or just compare the runtime pointer value.  In
> the vast majority of cases the two will coincide.
>
>
>
> ### [5/15] Can pointer values be copied indirectly?
>
> **Can you make a usable copy of a pointer by copying its
> representation bytes with code that indirectly computes the identity
> function on them, e.g. writing the pointer value to a file and then
> reading it back, and using compression or encryption on the way?**
>
> This seems unequivocably "yes", both in usage and for current
> compilers, at least in simple cases, with direct data-flow from
> original to computed pointer, both GCC and Clang support this.  But
> for computation via control-flow, it's not so clear.
>
> It looks as if a reasonable "mainstream C" semantics should allow
> indirect pointer copying at least whenever there's a data-flow
> provenance path, perhaps not when there's only a control-flow
> provenance path.  It should allow pointers to be marshalled to and
> read back in, and the simplest way of doing that is to allow any
> pointer value to be read in, with the compiler making no
> aliasing/provenance assumptions on such, and with the semantics
> checking the numeric pointer values points to a suitable live object
> only when and if it is dereferenced.
>
>
>
> ###[6/15] Pointer comparison at different types
>
> **Can one do == comparison between pointers to objects of different
> types (e.g. pointers to int, float, and different struct types)?**
>
> The question should have been clearer about whether the pointers are
> first cast to void* or char*.  With those casts, the responses seem
> clear that it should be allowed, modulo now-unusual architectures with
> segmented memory or where the pointer representations are different.
>
> Then there's a question, which we would hope applies only in the case
> without those casts, about whether -fstrict-aliasing will treat
> comparisons with type mismatches as nonequal, e.g.
>
> - "Depends on strict-aliasing flags? I think LLVM TBAA might optimise
>   this sort of check away?"
>
> - "There are a lot of examples of this, in particular in libc, or
>   possibly implementations of vtables."
>
>
>
>
> ### [7/15] Pointer comparison across different allocations
>
> **Can one do < comparison between pointers to separately allocated
> objects?**
>
> This seems to be widely used for lock ordering and collections.
> As for Q3, there's a potential issue for segmented memory systems
> (where the implementation might only compare the offset) which seems
> not to be relevant for current "mainstream" C.
>
> Apart from that, there doesn't seem to be any reason from compiler
> implementation to forbid it.
>
>
>
> ###[8/15] Pointer values after lifetime end
>
> **Can you inspect (e.g. by comparing with ==) the value of a pointer
> to an object after the object itself has been free'd or its scope has
> ended?**
>
> The responses mostly say that this will work (modulo debugging
> environments), with various use cases, and there is no clear reason
> why compilers do not support it.
>
> Can we either establish that current mainstream compilers will support
> this or identify more specifically where and how they will fail to do
> so?
>
>
> ### [9/15] Pointer arithmetic
>
> **Can you (transiently) construct an out-of-bounds pointer value (e.g.
> before the beginning of an array, or more than one-past its end) by
> pointer arithmetic, so long as later arithmetic makes it in-bounds
> before it is used to access memory?**
>
> It seems clear that this is often assumed to work. But on the other
> hand, compilers may sometimes assume otherwise:
>
> - "This is not safe; compilers may optimise based on pointers being
>   within bounds.  In some cases, it's possible such code might not
>   even link, depending on the offsets allowed in any relocations that
>   get used in the object files."
>
> - "The situation has not gotten friendlier to old-school pointer
>   manipulations since https://lwn.net/Articles/278137/ was written in
>   2008.  [This is a case where GCC optimised away a comparison
>   involving an out-of-bounds pointer] The pattern could still be found
>   in code exposed to malicious interlocutors in 2013:
>   https://access.redhat.com/security/cve/CVE-2013-5607"
>
> - "Pretty sure this one I've seen buggy code optimised away by real
>   compilers."
>
> Here the prevalence of transiently out-of-bounds pointer values in
> real code suggests it's worth seriously asking the cost of disabling
> whatever compiler optimisation is done based on this, to provide a
> simple predictable semantics.
>
>
>
> ### [10/15] Pointer casts
>
> **Given two structure types that have the same initial members, can
> you use a pointer of one type to access the intial members of a value
> of the other?**
>
> It's clear that this is used very commonly.  On the other hand, with
> strict aliasing:
>
> - "This is something that GCC tends to actually kill in practice (if
>   strict aliasing is on); I've had to fix bugs that were caused by
>   it."
>
> and w.r.t. GCC:
>
> - "This is not safe in practice (unless a union is visibly
>   used as described in 6.5.2.3#6)."
>
> though the latter doesn't say why, or whether that's specific to
> strict-aliasing.
>
> At least with -no-strict-aliasing, it seems this should be guaranteed
> to work.
>
>
>
> ### [11/15] Using unsigned char arrays
>
> **Can an unsigned character array be used (in the same way as a
> malloc’d region) to hold values of other types?**
>
> Here again it's clear that it's very often relied on for statically
> allocated (non-malloc'd) character arrays, and it should work, with
> due care about alignment. But the ISO standard disallows it, and we
> also see:
>
> - [wrt GCC] "No, this is not safe (if it's visible to the compiler
>   that the memory in question has unsigned char as its declared
>   type)."
>
> though the latter doesn't say why.   It is a violation of the
> strict-aliasing text of the ISO standard.
>
> With -no-strict-aliasing it seems clear it that it should be allowed.
>
>
>
> ### [12/15] Null pointers from non-constant expressions
>
> **Can you make a null pointer by casting from an expression that isn't
> a constant but that evaluates to 0?**
>
> This is very often assumed to work.  The only exception seems to be
> some (unidentified) embedded systems.
>
> Our "mainstream C" semantics should permit it.
>
>
>
> ### [13/15] Null pointer representations
>
> **Can null pointers be assumed to be represented with 0?**
>
> Basically an unequivocal "yes" for mainstream systems.  Again there a
> potential exception for segmented memory, but not one relevant for
> "mainstream" current practice.
>
>
>
> ### [14/15] Overlarge representation reads
>
> **Can one read the byte representation of a struct as aligned words
> without regard for the fact that its extent might not include all of
> the last word?**
>
> This is sometimes used in practice and believed to work, modulo
> alignment, page-boundary alignment, and valgrind/MSan/etc.
> Our "mainstream C" semantics could either forbid this entirely
> (slightly limiting the scope of the semantics) or could allow it, for
> sufficiently aligned cases, if some switch is set.
>
> For the moment we choose the former.
>
> ### [15/15] Union type punning
>
> **When is type punning - writing one union member and then reading it
> as a different member, thereby reinterpreting its representation bytes
> - guaranteed to work (without confusing the compiler analysis and
> optimisation passes)?**
>
> There's widespread doubt, disagreement and confusion here, e.g.:
>
> - "always"
> - "Never"
> - "According to the standard never; in practice always."
>
> Here the minimal thing it seems we should support is, broadly
> following the GCC documentation, type punning via a union whose
> definition is in scope and which is accessed via l-values that
> manifestly involve the union type.
>
> Or, in the -no-strict-aliasing case, one could allow it everywhere.
>
>
> POSTAMBLE RESPONSES
> ===================
>
> ### Other differences
>
> ** If you know of other areas where the C used in practice differs
> from that of the ISO standard, or where compiler optimisation limits
> the behaviour you can rely on, please list them. **
>
> There were many comments here which are hard to summarise. Many
> mention integer overflow and the behaviour of shift operators.
>
>
>
>
> ### Acknowledgements
>
> We would like to thank all those who responded to the survey, those
> who distributed it, and especially those members of the Cambridge
> Systems Research Group who helped us tune earlier versions.  This work
> is funded by the EPSRC REMS (Rigorous Engineering for Mainstream
> Systems) Programme Grant, EP/K008528/1.
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150626/5dd055d5/attachment.html>