[cfe-dev] C sequence-point analysis

Lukas Hellebrandt kamikazecz at gmail.com
Mon Nov 18 05:17:35 PST 2013


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi, Richard,

Thanks for confirming my thoughts! That's what I meant, make a set of
rules using Clang and then check them by LLVM.

As I said, I am relatively new to Clang and have never used LLVM
before. So, what's the best way to do this? Do I make a Clang plugin?
I can do that (I already did a part of the analyzer, however, without
any alias analysis; and I never did any change to the AST). And if
yes, how do I send some (meta)data to LLVM? The only way I can think
of is to save them to a file. And then, do I run a LLVM pass? Just
asking if I understood correctly. My problem is now more about using
Clang together with LLVM (as you confirmed my approach was a good
idea, thanks for that again!).

*****************************
Lukas Hellebrandt
kamikazecz at gmail.com
*****************************

On 11/15/2013 09:46 PM, Richard Smith wrote:
> On Fri, Nov 15, 2013 at 9:21 AM, Lukas Hellebrandt
> <kamikazecz at gmail.com <mailto:kamikazecz at gmail.com>> wrote:
> 
> Hi all,
> 
> I'm trying to write a tool for detecting undefined behavior in C 
> regarding sequence points and side effects.
> 
> I'm not sure whether it should be a Clang plugin or LLVM run or 
> something completely different (I'm really new to both Clang and
> LLVM) and that's what I need advice with.
> 
> For my work, I need to use both AST and alias analysis
> 
> Clang plugin: +relatively easy to use +access to AST with all the
> needed info EXCEPT alias analysis (right?) -no alias analysis, I'd
> need to write one myself
> 
> LLVM run: +built-in alias analysis (I'd like to use it, writing my
> own alias analysis is not really what my work is all about) -I do
> NOT have access to AST -I don't know it at all (but I'm ready to
> learn it if it shows up to be the best option)
> 
> The big PROBLEM is: a behavior that is undefined in C (and which
> Clang has access to) might be (and in my case WILL be) well defined
> in LLVM (for example, i=i++; is undefined in C but in LLVM code it
> will be already well defined and the result will depend on Clang
> behavior).
> 
> So I thought I could use both, in Clang create a list of rules,
> for example "On line L, there is an undefined behavior if X aliases
> with Y" and then SOMEHOW dig this info from LLVM run.
> 
> Is this a good idea? Is there a way (other than output to file
> from Clang and then read it in LLVM) to give this set of rules to
> LLVM? I'd also be glad for any other idea, not necessarily
> including LLVM and Clang, to solve this problem.
> 
> 
> I've thought about this a bit from the point of view of adding a
> check to -fsanitize=undefined. We already detect cases where it's
> "locally" obvious that the expression will have undefined behavior,
> for instance, we'll warn on your example of "i = i++;".
> 
> From the ubsan perspective, I think the thing to do here is to form
> a list of pairs of subexpressions within the full-expression where,
> if the two subexpressions refer to the same object, we will have
> unsequenced accesses (in the presence of ?:, &&, and ||, we'll need
> to be careful to only check these if the relevant subexpression was
> actually evaluated). Then, perhaps at the end of the
> full-expression, we emit a check that no such pair evaluated to the
> same address.
> 
> Building the set of pairs is a little tricky, but the code
> implementing -Wunsequenced can probably be reused to form this
> set.
> 
> That said, we have a representational weakness in LLVM here: we
> have no way to express that the language semantics allow us to
> freely reorder two memory operations. If we could emit somehow
> (perhaps as metadata) the information that we have about
> full-expressions, we could use that as input to an LLVM alias
> analysis -- then we could use the same representation both to drive
> a check and to drive optimizations. In the long term, that's
> probably the superior approach.
> 
> 
> (Not-fully-baked ideas follow...) We could imagine this working 
> something like the following. Given:
> 
> (a++, b) = c ? d++ : e
> 
> We divide it into separately-sequenced portions:
> 
> full expression: seq 0 a++: seq 1 for load, seq 2 for store b: seq
> 3 c: seq 4 d++: seq 5 for load, seq 6 for store e: seq 7 
> assignment: seq 8
> 
> We know that:
> 
> seq 2 and seq 3 are sequenced after seq 1 seq 4 and seq 5 are
> sequenced after seq 3 seq 6 is sequenced after seq 5 seq 8 is
> sequenced after seq 3, 5, and 7 seq 1-8 are nested within seq 0 (so
> are by default not sequenced with other things inside seq 0)
> 
> We could then express this as metadata as:
> 
> %a = load @a, !seq !seq1 %a.1 = add %a, 1 store @a, %a.1, !seq
> !seq2 ; ...
> 
> !seq0 = !{} !seq1 = !{seq0} ; parent is 0 !seq2 = !{seq0, !{seq1}}
> ; parent is 0, sequenced after 1 !seq3 = !{seq0, !{seq1}} ; parent
> is 0, sequenced after 1 ; ...
> 
> ... with the semantics that memory operations A and B are known not
> to alias if at least one is a load, and they have the same parent,
> and neither is (transitively) sequenced before the other.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iF4EAREIAAYFAlKKE28ACgkQHFHs/Czs0u61uAD+M+vsA0kFpGS9YztJkuN9Jkqb
8a8q4RfFdCJiyku75QwBAKrshaFAfM98UMpymLrEbJn6q60UK9hw9wkny6u0M17S
=GLxY
-----END PGP SIGNATURE-----



More information about the cfe-dev mailing list