[cfe-dev] Language extension: attribute((bounds))/attribute((bounded))

Wed Sep 23 19:46:46 PDT 2009

Hi Edwin,

I find this proposal interesting!  I have a few comments inline, but I  
also wanted to draw attention to Microsoft's "Standard Annotation  
Language" (or SAL), which were originally derived to extend buffer  
overflow checking by the compiler (and other tools):

http://msdn.microsoft.com/en-us/library/ms235402(VS.80).aspx

While I'm not suggesting that we definitely implement SAL in Clang, I  
think it is worth at least considering some of the ideas there, as the  
encompass not only the annotation you propose but a more general set  
of annotations that can be used for security checking.  Moreover,  
there is some value having a common set of buffer overflow annotations  
that work across various platforms and compilers.

Comments inline.

On Sep 23, 2009, at 11:59 AM, Török Edwin wrote:

> Hi,
>
> I started working on adding a new attribute to Clang, to annotate
> pointers with bounds information,
> before going further I'd like to hear your opinion whether this is
> acceptable for clang (or if we can
> modify this proposal for it to be acceptable).
>
> Given a pointer field in a struct, or a pointer parameter, the  
> proposed
> __attribute__ would tell
> the compiler which field/parameter holds the bounds of the buffer  
> pointer.

Is this extent in bytes, or in the number of items?

>
> This is not a completely new attribute, in fact OpenBSD has had
> something similar for GCC,
> but only for function parameters [1].
>
> Some possible use cases for this attribute:
> - emit a warning when a buffer pointer is passed to a function if the
> length argument is larger than the
> actual length of the buffer
> - use this attribute in the clang static analyzer to do bounds-checks
> - use this attribute to do static analysis in LLVM
> - use clang to compile a subset of C, with this attribute, where the
> bounds of all pointers are known,
> and reject programs that are not part of this subset

I agree very much that the attribute would be useful in all of these  
contexts.

>
> OpenBSD's attribute gives warnings only when the parameters are
> constant, but I'd like to take it further,
> and also warn/error when the length is stored in a struct, later read
> and passed as parameter to a function.

This kind of symbolic reasoning could be done in the clang static  
analyzer.

> For now I implemented the sema+cg part for the bounds attribute, small
> example attached at end of this mail [2]
>
> Since my attribute will do more/other than the OpenBSD attribute, to
> avoid clashes, I named my attribute 'bounds' (suggestions welcome),  
> but
> I intend to support BSD's __bounded__ attribute too!
> Also the __bounded__ attribute uses indexes, rather than names,  
> which I
> consider error prone.

These attribute names are so similar that it would inevitably create  
conflicts.  SAL uses '_ecount' and '_bcount' (element and byte count  
respectively), so you could use something like 'buffer_length' and the  
like.  It's only a few more characters and explicitly states what the  
attribute does.

> Proposed semantics:
> struct foo {
> unsigned n;
> unsigned *x __attribute__((bounds(n)));
> };
>
> This will declare that x is an array of 'n' elements, where 'n' is the
> field of the same struct.
> NULL pointers will have size 0 (and so will buffers of size 0 which  
> may
> or may not be null).

In terms of the parsing rules, would you allow 'n' to be declared  
after the variable it bounds?  e.g.:

struct foo {
unsigned *x __attribute__((bounds(n)));
unsigned n;
};

Here 'n' is declared after 'x'.

The same question applies to function parameters.

> It will be enforced at 2 places:
> - storing values: 'n' must always be accurate and reflect the current
> size of 'x' (which means store order matters), with one exception: if
> the 'struct foo' variable, or pointer to it is only visible in current
> function, and doesn't escape, it is allowed to store in any order.

I understand the motivation behind these restrictions, but my initial  
reaction is that they would be overly onerous for many contexts.  For  
example, consider:

void bar(struct foo *p) {
  p->x = 0;
  p->n = 0;
}

Wouldn't this be deemed illegal since 'p->x' can potentially be  
referenced elsewhere?  Is the concern mainly that 'p->x' might be  
referenced in another thread?

> - dereferencing x: the index must always be in the range [0, n)
> - check that 'n' really is the size of the allocation
> - if the size is unknown, it will warn/emit error too

Can you be a little more specific by what you mean by "unknown"?

It seems to me that we need to distinguish between runtime and static  
checking.  With static/compiler analysis, there may be many cases  
where we cannot prove the index is within the range [0, n), but  
obviously can prove that it is within this range at runtime.

> Note that these checks don't have to be done in clang necesarely, they
> can be done in LLVM (in fact I'd prefer, it is easier
> to do dataflow analysis there).

The analysis will certainly be easier at the LLVM IR level, although  
reporting the diagnostics in an intelligible manner (for static  
checking) will probably lend itself more to doing it in Clang.  This  
is a common tradeoff.

>
> This attribute should work for pointers to constant size arrays,  VLAs
> [3], pointers coming from malloc (and malloc-like wrappers).

Could you provide an example of using malloc() with these attributes?   
I'm assuming something like:

   p->x = malloc(n);
   p->n = n;

>
> For function parameters the semantics is as described in [1], except
> checks will be done for non-constant sized buffers too.
>
> The emitted LLVM code will use metadata to capture this information,  
> see
> [2].
>
> [1]
> http://www.openbsd.org/cgi-bin/man.cgi?query=gcc-local&apropos=0&sektion=0&manpath=OpenBSD+Current&arch=i386&format=html

Food for thought.  Suppose I have 'struct foo' as defined above, and  
do the following:

void baz(unsigned **x_ref);

void test() {
   struct foo s;
   s.x = malloc(10);
   s.n = 10;
   baz(&s.x);
}

Should this be flagged with a warning by the compiler, as the  
invariant between 's.x' and 's.n' might be violated by the call to  
'baz'?  If I had:

void test_with_assign() {
   struct foo s;
   s.x = malloc(10);
   s.n = 10;
   baz(&s.x);
   s.n = 20;
}

How should I go about enforcing the invariant "storing values: 'n'  
must always be accurate and reflect the current size of 'x'", either  
statically or dynamically, as 'x' may have changed after the call to  
'baz' and we have no further extent information?  I guess what I'm  
saying is that without knowledge of the implementation of 'baz' this  
couldn't be enforced by static checking at all (unless we make  
pessimistic assumptions or require more annotations).  I'm not certain  
what information you would track at runtime to ensure these invariants.

Cheers,
Ted

[cfe-dev] Language extension: __attribute__((bounds))/__attribute__((__bounded__))

[cfe-dev] Language extension: attribute((bounds))/attribute((bounded))