[cfe-dev] Language extension: attribute((bounds))/attribute((bounded))

Thu Sep 24 00:45:33 PDT 2009

On 2009-09-24 05:46, Ted Kremenek wrote:
> Hi Edwin,
>
> I find this proposal interesting!

Hi Ted,

Thanks for the patience of reading through my (long) mail.

> I have a few comments inline, but I also wanted to draw attention to
> Microsoft's "Standard Annotation Language" (or SAL), which were
> originally derived to extend buffer overflow checking by the compiler
> (and other tools):
>
> http://msdn.microsoft.com/en-us/library/ms235402(VS.80).aspx
>
> While I'm not suggesting that we definitely implement SAL in Clang, I
> think it is worth at least considering some of the ideas there, as the
> encompass not only the annotation you propose but a more general set
> of annotations that can be used for security checking.  Moreover,
> there is some value having a common set of buffer overflow annotations
> that work across various platforms and compilers.

It seems the SAL language doesn't use attributes, but reserved keywords,
it would be trivial to implement a sal.h header once we have the attributes:
#define _ecount(variable) __attribute__((ebounds(variable)))
....

Adding attributes is easier than adding new keywords to clang, isn't it?

>
> Comments inline.
>
> On Sep 23, 2009, at 11:59 AM, Török Edwin wrote:
>
>> Hi,
>>
>> I started working on adding a new attribute to Clang, to annotate
>> pointers with bounds information,
>> before going further I'd like to hear your opinion whether this is
>> acceptable for clang (or if we can
>> modify this proposal for it to be acceptable).
>>
>> Given a pointer field in a struct, or a pointer parameter, the proposed
>> __attribute__ would tell
>> the compiler which field/parameter holds the bounds of the buffer
>> pointer.
>
> Is this extent in bytes, or in the number of items?

Number of items, because the number of bytes may be target dependent.
However we could have another attribute that is number of bytes.

>
>>
>> This is not a completely new attribute, in fact OpenBSD has had
>> something similar for GCC,
>> but only for function parameters [1].
>>
>> Some possible use cases for this attribute:
>> - emit a warning when a buffer pointer is passed to a function if the
>> length argument is larger than the
>> actual length of the buffer
>> - use this attribute in the clang static analyzer to do bounds-checks
>> - use this attribute to do static analysis in LLVM
>> - use clang to compile a subset of C, with this attribute, where the
>> bounds of all pointers are known,
>> and reject programs that are not part of this subset
>
> I agree very much that the attribute would be useful in all of these
> contexts.
>
>>
>> OpenBSD's attribute gives warnings only when the parameters are
>> constant, but I'd like to take it further,
>> and also warn/error when the length is stored in a struct, later read
>> and passed as parameter to a function.
>
> This kind of symbolic reasoning could be done in the clang static
> analyzer.
>
>> For now I implemented the sema+cg part for the bounds attribute, small
>> example attached at end of this mail [2]
>>
>> Since my attribute will do more/other than the OpenBSD attribute, to
>> avoid clashes, I named my attribute 'bounds' (suggestions welcome), but
>> I intend to support BSD's __bounded__ attribute too!
>> Also the __bounded__ attribute uses indexes, rather than names, which I
>> consider error prone.
>
> These attribute names are so similar that it would inevitably create
> conflicts.  SAL uses '_ecount' and '_bcount' (element and byte count
> respectively), so you could use something like 'buffer_length' and the
> like.  It's only a few more characters and explicitly states what the
> attribute does.

I don't care too much about how the attribute is called, so whatever is
easier to understand works for me. buffer_length or buffer_element_count
would do just fine.

>
>> Proposed semantics:
>> struct foo {
>> unsigned n;
>> unsigned *x __attribute__((bounds(n)));
>> };
>>
>> This will declare that x is an array of 'n' elements, where 'n' is the
>> field of the same struct.
>> NULL pointers will have size 0 (and so will buffers of size 0 which may
>> or may not be null).
>
> In terms of the parsing rules, would you allow 'n' to be declared
> after the variable it bounds?  e.g.:
>
> struct foo {
> unsigned *x __attribute__((bounds(n)));
> unsigned n;
> };
>
> Here 'n' is declared after 'x'.
>
> The same question applies to function parameters.

I intend to allow it, currently it doesn't work because I haven't
figured out how to do it yet:
ProcessDeclAttribute is called before 'n' is parsed in your example, and
d->getDeclContext()->lookup() doesn't find it.
Maybe I should add the name to the AST, and look it up later?

>
>> It will be enforced at 2 places:
>> - storing values: 'n' must always be accurate and reflect the current
>> size of 'x' (which means store order matters), with one exception: if
>> the 'struct foo' variable, or pointer to it is only visible in current
>> function, and doesn't escape, it is allowed to store in any order.
>
> I understand the motivation behind these restrictions, but my initial
> reaction is that they would be overly onerous for many contexts.  For
> example, consider:
>
> void bar(struct foo *p) {
>  p->x = 0;
>  p->n = 0;
> }
>
> Wouldn't this be deemed illegal since 'p->x' can potentially be
> referenced elsewhere?  Is the concern mainly that 'p->x' might be
> referenced in another thread?

Yes, another thread is one concern, but not what I was referrring to: it
may read an out-of-date value. But then another thread should use a lock
before accessing common data, and proving that it does, or doesn't is
out of scope for my analysis.

Rather my concern was another function might be called:
p->x=0;
foobar(p);
p->n=0;

Or:
p->x=0;
foobar(z);
p->n = 0;

And mayalias(p, z), or mustalias(p, z).

For the purposes of this annotation, it is enough if no other function
is called between setting the two fields. Of course this must include
any stores that write to a pointer that mayalias those fields,
which may lead to false positives.

>
>> - dereferencing x: the index must always be in the range [0, n)
>> - check that 'n' really is the size of the allocation
>> - if the size is unknown, it will warn/emit error too
>
> Can you be a little more specific by what you mean by "unknown"?

p->n = n;
p->x = foo(n);

Unless we see the body of foo(), or unless it has the malloc-like
attribute (which btw should be validated that it really allocs as many
elements as requested), or a bounds attribute on the return value, we
can't assume anything about the pointer it returns.
In this case clang/llvm should emit a warning/error, just as it would if
the wrong value would be stored for n: we don't know the size of the
pointer.

There is also a problem when a function's body is not known, but it has
annotations.  In this case we can do two things:
 - if the annotations come from a system header, assume they are
correct, and that whoever wrote them knew what they were doing
 - if the annotations come from a user header, we could either trust it
(but what if the function's body was compiled without including the header
that defines the annotations?), or give a warning that we can't verify
its body. Again this could be a compiler flag.

Can the clang analyzer check properties accross translation units, and
across library boundaries?
Maybe it could compile to bitcode, and try to link the bitcodes, the
LLVM linker could verify that you don't link functions which have
contradicting metadata annotations,
or missing annotations.

Either way, trusting user annotations on a function without a body is
not good, it is very easy to mistakenly fool the analyzer into thinking
the program is safe, when it is not.

>
> It seems to me that we need to distinguish between runtime and static
> checking.  With static/compiler analysis, there may be many cases
> where we cannot prove the index is within the range [0, n), but
> obviously can prove that it is within this range at runtime.

I agree that a compile time check can't prove everything, so I think
there should be a flag: give warning/error only when compiler can
statically prove
you violate the annotation, or give warning/error when compile-time
check can't prove everything, or insert runtime checks.
Inserting runtime checks is not trivial, and not on my plan.

As a start I'd go for implementing the first part, because that is what
I currently need (accept only programs that pass all compile-time
checks, and add
annotations to guide the compile-time checks).

Also giving warnings is enough, I can use -Werror if I want them to be
errors.

>
>> Note that these checks don't have to be done in clang necesarely, they
>> can be done in LLVM (in fact I'd prefer, it is easier
>> to do dataflow analysis there).
>
> The analysis will certainly be easier at the LLVM IR level, although
> reporting the diagnostics in an intelligible manner (for static
> checking) will probably lend itself more to doing it in Clang.  This
> is a common tradeoff.

In either case I'll need this info at the IR level (even if checking is
done in clang). Having it at the IR level also allows other languages to
use this annotation.
Although I can't give so nicely formatted html reports from LLVM IR, I
can give a source:file:line:errormessage-style reports from there, given
that  I have debug info,
and I run the analysis early enough that I still have all relevant debug
info.

>
>>
>> This attribute should work for pointers to constant size arrays,  VLAs
>> [3], pointers coming from malloc (and malloc-like wrappers).
>
> Could you provide an example of using malloc() with these attributes? 
> I'm assuming something like:
>
>   p->x = malloc(n);
>   p->n = n;

Yes, except it should have a nullpointer check, something like:
p->x = malloc(n);
if (!p->x)
    return;
p->n = n;

Or:
p->n = n;
p->x = malloc(n)
if (!p->x) {
  p->n = 0;
 ...
}

>
>>
>> For function parameters the semantics is as described in [1], except
>> checks will be done for non-constant sized buffers too.
>>
>> The emitted LLVM code will use metadata to capture this information, see
>> [2].
>>
>> [1]
>> http://www.openbsd.org/cgi-bin/man.cgi?query=gcc-local&apropos=0&sektion=0&manpath=OpenBSD+Current&arch=i386&format=html
>>
>
> Food for thought.  Suppose I have 'struct foo' as defined above, and
> do the following:
>
> void baz(unsigned **x_ref);
>
> void test() {
>   struct foo s;
>   s.x = malloc(10);
>   s.n = 10;
>   baz(&s.x);
> }
>
> Should this be flagged with a warning by the compiler, as the
> invariant between 's.x' and 's.n' might be violated by the call to
> 'baz'?  If I had:
>
> void test_with_assign() {
>   struct foo s;
>   s.x = malloc(10);
>   s.n = 10;
>   baz(&s.x);
>   s.n = 20;
> }
>
> How should I go about enforcing the invariant "storing values: 'n'
> must always be accurate and reflect the current size of 'x'", either
> statically or dynamically, as 'x' may have changed after the call to
> 'baz' and we have no further extent information?  I guess what I'm
> saying is that without knowledge of the implementation of 'baz' this
> couldn't be enforced by static checking at all (unless we make
> pessimistic assumptions or require more annotations).  I'm not certain
> what information you would track at runtime to ensure these invariants.

It should give a warning that bounds info may be inaccurate both at the
call to baz, and at s.n=20;
Enforcing it at runtime is not among my goals as stated above, but I
think someone may implement a mudflap-like approach for it.

To sum up my immediate goals with this attribute:
- allow annotating system headers (which includes clang's own
lib/Headers headers)
- allow annotating user functions/structs, but the compiler must check
that these annotations are accurate (i.e. the function really never
accesses the buffer outside its bounds,
or it really returns a buffer that has at least as many elements as it
claims)
- emit warnings if the program can't be statically proven to always obey
the attributes
- emit warnings if the program can't be statically proven to obey
bounds, given information from this attribute

Long term goals would be:
- insert runtime checks where that is easy to do (i.e. we have an
attribute on a parameter and an index, but can't statically prove
anything about the index: runtime check index against parameter),
and emit warning otherwise
- do cross-translation-unit checks

Could be done, but not on my TODOlist:
- do runtime checks for all pointers, this could use a mudflap-like or
softbound-like approach

Best regards,
--Edwin

[cfe-dev] Language extension: __attribute__((bounds))/__attribute__((__bounded__))

[cfe-dev] Language extension: attribute((bounds))/attribute((bounded))