[LLVMdev] C as used/implemented in practice: analysis of responses

Fri Jun 26 01:42:12 PDT 2015

As part of a project to clarify what behaviour of C implementations is
actually relied upon in modern practice, and what behaviour is
guaranteed by current mainstream implementations, we recently
distributed a survey of 15 questions about C, https://goo.gl/AZXH3S.

We were asking what C is in current mainstream practice: the behaviour
that programmers assume they can rely on, the behaviour provided by
mainstream compilers, and the idioms used in existing code, especially
systems code.  We were *not* asking what the ISO C standard permits,
which is often more restrictive, or about obsolete or obscure hardware
or compilers.  We focussed on the behaviour of memory and pointers.

We've had around 300 responses, including many compiler and OS
developers, and the results are summarised below, or on the web at
http://www.cl.cam.ac.uk/~pes20/cerberus (which also has more details).
For many questions the outcome seems clear, but for some, especially
1, 2, 9, 10, and 11, major open questions about current compiler
behaviour remain; we'd greatly appreciate informed comments on those
from the relevant compiler developers (or other experts).

If you can answer these, please reply either below or by mailing the
Cerberus mailing list:

  cl-cerberus at lists.cam.ac.uk

  https://lists.cam.ac.uk/mailman/listinfo/cl-cerberus

many thanks,
Kayvan Memarian and Peter Sewell  (University of Cambridge)

What is C in practice? (Cerberus survey):  Conclusions
======================================================================

##           Kayvan Memarian and Peter Sewell
###                University of Cambridge, 2015-06-21

----------------------------------------------------------------------

In April-June 2015 we distributed a web survey to investigate what C
is, in current mainstream practice: the behaviour that programmers
assume they can rely on, the behaviour provided by mainstream
compilers, and the idioms used in existing code, especially systems
code.  We were *not* asking what the ISO C standard permits, which is
often more restrictive, or about obsolete or obscure hardware or
compilers.  We focussed on the behaviour of memory and pointers.
This is a step towards an unambiguous and mathematically precise definition
of the consensus *de facto* standard (to the extent that such
exists): the C that is actually used by systems programmers and
implemented by mainstream compilers.

This note summarises our conclusions. For many questions the outcome
seems clear, but for some, especially 1, 2, 9, 10, and 11, major open
questions about current compiler behaviour remain; we'd greatly
appreciate further comments from the relevant compiler developers or
other experts.

More details of the results are available from
[http://www.cl.cam.ac.uk/~pes20/cerberus](http://www.cl.cam.ac.uk/~pes20/cerberus).

### Responses

Aiming for a modest-scale but technically expert audience, we
distributed the survey at the University of Cambridge systems research
group, at EuroLLVM 2015, via John Regehr's blog, and via various
mailing lists: gcc, llvmdev, cfe-dev, libc-alpha, xorg, a FreeBSD
list, xen-devel, a Google C users list, and a Google C compilers list.
In all there were around 300 responses, including around 100 printed
pages of textual comments.  Most report expertise in C systems
programming and significant numbers report expertise in compiler
internals and in the C standard.

MAIN QUESTION RESPONSES
-----------------------

###[1/15] How predictable are reads from padding bytes?

**If you zero all bytes of a struct and then write some of its
members, do reads of the padding return zero?  (e.g. for a bytewise
CAS or hash of the struct, or to know that no security-relevant data
has leaked into them.)**

It remains unclear what behaviour compilers currently provide (or
should provide) for this.  We see four main alternatives:

a) Structure copies might copy padding, but structure member writes
never touch padding.

b) Structure member writes might write zeros over subsequent padding.

c) Structure member writes might write arbitrary values over subsequent
padding, with reads seeing stable results.

d) Padding bytes are regarded as always holding unspecified values,
irrespective of any byte writes to them, and so reads of them might
return arbitrary and unstable values (in which case the compiler could
arbitrarily write to padding at any point, as that would be masked by
this).

On the one hand:

- In some circumstances it seems important to provide systems
   programmers with a mechanism to ensure that no information is
   leaked via padding. Rewriting structure definitions to make all
   padding into explicit fields may not be practicable, especially if
   one wants to do so in a platform-independent way, and so option (d)
   is not compatible with this.  Option (c) makes it possible but
   awkward to prevent leakage, as there padding must be re-zero'd
   after member writes.

- In some circumstances programmers may rely on predictable padding
   values, at least in the absence of structure member writes,
   e.g. for memcmp, hashing, or compare-and-swap of struct values.
   Again (d) is not compatible with this, and (a) or (b) are
   preferable.  But it's not clear whether any of those usages are
   common or essential.

- For Clang, one respondent suggests that by the time the optimisation
   passes operate, padding has been replaced by explicit fields, so
   neither over-wide writes or permanently-undefined-value behaviour
   will occur.

- For MSVC, one respondent suggests the compiler provides (a).

On the other hand, others suggest plausible optimisations, e.g.,
"to apply SRA (scalar replacement of aggregates), replacing the
memset with a sequence of member assignments (discarding assignments
to padding) in order to do so".  This could require (d) to make the
existing compiler behaviour admissible, but it's unclear
to us whether it actually does at present.

Note also that for structs stored to malloc'd regions, option (d) is
at odds with the idea that malloc'd regions can be reused, as a write
of a new value (of a new type) to a malloc'd region might start with
writes to what were padding bytes, so perhaps we could only really
have this semantics for the other storage-duration kinds.

Conclusion:

For each compiler (GCC, Clang, MSVC, ICC, ...), we ask which of these
it provides on mainstream platforms?  If the answer is not (a), to
what extent is it feasible to provide compiler flags that force the
behaviour to be stronger?

Our "mainstream C" semantics should provide switchable options for
each of (a), (b), (c), and (d).

### [2/15] Uninitialised values

**Is reading an uninitialised variable or struct member (with a
current mainstream compiler):**

**(This might either be due to a bug or be intentional, e.g. when
copying a partially initialised struct, or to output, hash, or set
some bits of a value that may have been partially initialised.)**

a) undefined behaviour (meaning that the compiler is free to
arbitrarily miscompile the program, with or without a warning),

b) going to make the result of any expression involving that value
unpredictable,

c) going to give an arbitrary and unstable value (maybe with a
different value if you read again), or

d) going to give an arbitrary but stable value (with the same value if
you read again).

Here also it remains unclear what compilers current provide and what
it should provide.  The survey responses are dominated by the
(a) "undefined behaviour" and (d) "arbitrary but stable" options.

It's not clear whether people are actually depending on the latter,
beyond the case of copying a partially initialised struct, which it
seems must be supported, and comparing against a partially initialised
struct, which it seems is done sometimes.  Many respondents mention
historical uses to attempt to get entropy, but that seems now widely
regarded as crazy.  There is a legitimate general argument that the
more determinacy that can be provided the better, for debugging.

But it seems clear from the text responses that GCC, Clang, and MSVC
do not at present exploit the licence the ISO standard gives (in
defining this to be undefined behaviour) to arbitrarily miscompile
code, option (a).  Clang seems to be the most aggressive, propagating
undef in many cases, though one respondent said "LLVM is moving
towards treating this as UB in the cases where the standards allow it
to do so".  But there are special cases where LLVM is a bit stronger
(cf the undef docs); it's unclear why they think those are useful.
For GCC, one respondent said

  "Going to give arbitrary, unstable values (that is, the variable
  assigned from the uninitialised variable itself acts as
  uninitialised and having no consistent value).  (Quite possibly
  subsequent transformations will have the effect of undefined
  behavior.)  Inconsistency of observed values is an inevitable
  consequence of transformations PHI (undefined, X) -> X (useful in
  practice for programs that don't actually use uninitialised
  variables, but where the compiler can't see that)."

For MSVC, one respondent said:

  "I am aware of a significant divergence between the LLVM community
  and MSVC here; in general LLVM uses "undefined behaviour" to mean
  "we can miscompile the program and get better benchmarks", whereas
  MSVC regards "undefined behaviour" as "we might have a security
  vulnerability so this is a compile error / build break".  First,
  there is reading an uninitialized variable (i.e. something which
  does not necessarily have a memory location); that should always be
  a compile error.  Period.  Second, there is reading a partially
  initialised struct (i.e. reading some memory whose contents are only
  partly defined).  That should give a compile error/warning or static
  analysis warning if detectable.  If not detectable it should give
  the actual contents of the memory (be stable).  I am strongly with
  the MSVC folks on this one - if the compiler can tell at compile
  time that anything is undefined then it should error out.  Security
  problems are a real problem for the whole industry and should not be
  included deliberately by compilers."

For each compiler we ask which of these four semantics it provides.

It looks as if several compiler writers are saying (b), while a
significant number of programmers are relying on (d).

Our "mainstream C" semantics should support both (b) and (d).

### [3/15] Can one use pointer arithmetic between separately allocated
C objects?

**If you calculate an offset between two separately allocated C memory
objects (e.g. malloc'd regions or global or local variables) by
pointer subtraction, can you make a usable pointer to the second by
adding the offset to the address of the first?**

Most respondents expect this to work, and a significant number know of
real code that relies on it, with many concrete examples.  So far we
see no solid reason to disallow it for "mainstream" C implementations.
The main reason to disallow it seems to have been segmented
architectures, especially 8086.  There are still some embedded
architectures with distinct address spaces but it seems that
"mainstream" C is not concerned with this, and those cases could be
identified as a language dialect or implementation-defined choice.

For GCC, one respondent writes the following, but doesn't give a reason:

- This is not safe in practice even if the alignment is sufficient
  (and if the alignment of the type is less than its size, obviously
  such a subtraction can't possibly work even with a naive compiler).

It looks reasonable to identify two language dialects, one (our
"mainstream C") in which pointer arithmetic between separately
allocated C objects is permitted and one (principally for segmented
architectures) in which it is not.

### [4/15] Is pointer equality sensitive to their original allocation sites?

**For two pointers derived from the addresses of two separate
allocations, will equality testing (with ==) of them just compare
their runtime values, or might it take their original allocations into
account and assume that they do not alias, even if they happen to have
the same runtime value? (for current mainstream compilers)**

The responses are roughly bimodal: many believe "it will just compare
the runtime values", while a similar number believe that the
comparison might take the allocation provenance into account.
Of the former, 41 "know of real code that relies on it".

In practice we see that GCC does sometimes take allocation provenance
into account, with the result of a comparison (in an n+1 case,
comparing &p+1 and &q) sometimes varying depending on whether the
compiler can see the provenance, e.g. on whether it's done in the same
compilation unit as the allocation.  We don't see any reason to forbid
that, especially as this n+1 case seems unlikely to arise in practice,
though it does complicate the semantics, effectively requiring a
nondeterministic choice at each comparison of whether to take
provenance into account.  But for comparisons between pointers formed
by more radical pointer arithmetic from pointers originally from
different allocations, as in Q3, it's not so clear.

The best "mainstream C" semantics here seems to be to make a
nondeterministic choice at each comparison of whether to take
provenance into account or just compare the runtime pointer value.  In
the vast majority of cases the two will coincide.

### [5/15] Can pointer values be copied indirectly?

**Can you make a usable copy of a pointer by copying its
representation bytes with code that indirectly computes the identity
function on them, e.g. writing the pointer value to a file and then
reading it back, and using compression or encryption on the way?**

This seems unequivocably "yes", both in usage and for current
compilers, at least in simple cases, with direct data-flow from
original to computed pointer, both GCC and Clang support this.  But
for computation via control-flow, it's not so clear.

It looks as if a reasonable "mainstream C" semantics should allow
indirect pointer copying at least whenever there's a data-flow
provenance path, perhaps not when there's only a control-flow
provenance path.  It should allow pointers to be marshalled to and
read back in, and the simplest way of doing that is to allow any
pointer value to be read in, with the compiler making no
aliasing/provenance assumptions on such, and with the semantics
checking the numeric pointer values points to a suitable live object
only when and if it is dereferenced.

###[6/15] Pointer comparison at different types

**Can one do == comparison between pointers to objects of different
types (e.g. pointers to int, float, and different struct types)?**

The question should have been clearer about whether the pointers are
first cast to void* or char*.  With those casts, the responses seem
clear that it should be allowed, modulo now-unusual architectures with
segmented memory or where the pointer representations are different.

Then there's a question, which we would hope applies only in the case
without those casts, about whether -fstrict-aliasing will treat
comparisons with type mismatches as nonequal, e.g.

- "Depends on strict-aliasing flags? I think LLVM TBAA might optimise
  this sort of check away?"

- "There are a lot of examples of this, in particular in libc, or
  possibly implementations of vtables."

### [7/15] Pointer comparison across different allocations

**Can one do < comparison between pointers to separately allocated objects?**

This seems to be widely used for lock ordering and collections.
As for Q3, there's a potential issue for segmented memory systems
(where the implementation might only compare the offset) which seems
not to be relevant for current "mainstream" C.

Apart from that, there doesn't seem to be any reason from compiler
implementation to forbid it.

###[8/15] Pointer values after lifetime end

**Can you inspect (e.g. by comparing with ==) the value of a pointer
to an object after the object itself has been free'd or its scope has
ended?**

The responses mostly say that this will work (modulo debugging
environments), with various use cases, and there is no clear reason
why compilers do not support it.

Can we either establish that current mainstream compilers will support
this or identify more specifically where and how they will fail to do
so?

### [9/15] Pointer arithmetic

**Can you (transiently) construct an out-of-bounds pointer value (e.g.
before the beginning of an array, or more than one-past its end) by
pointer arithmetic, so long as later arithmetic makes it in-bounds
before it is used to access memory?**

It seems clear that this is often assumed to work. But on the other
hand, compilers may sometimes assume otherwise:

- "This is not safe; compilers may optimise based on pointers being
  within bounds.  In some cases, it's possible such code might not
  even link, depending on the offsets allowed in any relocations that
  get used in the object files."

- "The situation has not gotten friendlier to old-school pointer
  manipulations since https://lwn.net/Articles/278137/ was written in
  2008.  [This is a case where GCC optimised away a comparison
  involving an out-of-bounds pointer] The pattern could still be found
  in code exposed to malicious interlocutors in 2013:
  https://access.redhat.com/security/cve/CVE-2013-5607"

- "Pretty sure this one I've seen buggy code optimised away by real
  compilers."

Here the prevalence of transiently out-of-bounds pointer values in
real code suggests it's worth seriously asking the cost of disabling
whatever compiler optimisation is done based on this, to provide a
simple predictable semantics.

### [10/15] Pointer casts

**Given two structure types that have the same initial members, can
you use a pointer of one type to access the intial members of a value
of the other?**

It's clear that this is used very commonly.  On the other hand, with
strict aliasing:

- "This is something that GCC tends to actually kill in practice (if
  strict aliasing is on); I've had to fix bugs that were caused by
  it."

and w.r.t. GCC:

- "This is not safe in practice (unless a union is visibly
  used as described in 6.5.2.3#6)."

though the latter doesn't say why, or whether that's specific to
strict-aliasing.

At least with -no-strict-aliasing, it seems this should be guaranteed
to work.

### [11/15] Using unsigned char arrays

**Can an unsigned character array be used (in the same way as a
malloc’d region) to hold values of other types?**

Here again it's clear that it's very often relied on for statically
allocated (non-malloc'd) character arrays, and it should work, with
due care about alignment. But the ISO standard disallows it, and we
also see:

- [wrt GCC] "No, this is not safe (if it's visible to the compiler
  that the memory in question has unsigned char as its declared
  type)."

though the latter doesn't say why.   It is a violation of the
strict-aliasing text of the ISO standard.

With -no-strict-aliasing it seems clear it that it should be allowed.

### [12/15] Null pointers from non-constant expressions

**Can you make a null pointer by casting from an expression that isn't
a constant but that evaluates to 0?**

This is very often assumed to work.  The only exception seems to be
some (unidentified) embedded systems.

Our "mainstream C" semantics should permit it.

### [13/15] Null pointer representations

**Can null pointers be assumed to be represented with 0?**

Basically an unequivocal "yes" for mainstream systems.  Again there a
potential exception for segmented memory, but not one relevant for
"mainstream" current practice.

### [14/15] Overlarge representation reads

**Can one read the byte representation of a struct as aligned words
without regard for the fact that its extent might not include all of
the last word?**

This is sometimes used in practice and believed to work, modulo
alignment, page-boundary alignment, and valgrind/MSan/etc.
Our "mainstream C" semantics could either forbid this entirely
(slightly limiting the scope of the semantics) or could allow it, for
sufficiently aligned cases, if some switch is set.

For the moment we choose the former.

### [15/15] Union type punning

**When is type punning - writing one union member and then reading it
as a different member, thereby reinterpreting its representation bytes
- guaranteed to work (without confusing the compiler analysis and
optimisation passes)?**

There's widespread doubt, disagreement and confusion here, e.g.:

- "always"
- "Never"
- "According to the standard never; in practice always."

Here the minimal thing it seems we should support is, broadly
following the GCC documentation, type punning via a union whose
definition is in scope and which is accessed via l-values that
manifestly involve the union type.

Or, in the -no-strict-aliasing case, one could allow it everywhere.

POSTAMBLE RESPONSES
===================

### Other differences

** If you know of other areas where the C used in practice differs
from that of the ISO standard, or where compiler optimisation limits
the behaviour you can rely on, please list them. **

There were many comments here which are hard to summarise. Many
mention integer overflow and the behaviour of shift operators.

### Acknowledgements

We would like to thank all those who responded to the survey, those
who distributed it, and especially those members of the Cambridge
Systems Research Group who helped us tune earlier versions.  This work
is funded by the EPSRC REMS (Rigorous Engineering for Mainstream
Systems) Programme Grant, EP/K008528/1.