[cfe-dev] C sequence-point analysis

Mon Nov 18 11:30:08 PST 2013

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Your last message confused me a lot. Are you suggesting inserting
run-time checks into the AST? What about the whole metadata-thing from
your original message? I thought those were metadata for LLVM so I
could check the must-not-alias rules defined in them.

Run-time check is not my goal, I am trying to write a static check (of
course, false positives will happen). I understand your warning about
changes in the compiler, but anyway, is it possible to check this
statically (using the set of rules I am talking about, and maybe the
same rules you wrote about, but as I say, I'm confused about it)?

A short example so I make it clear:

Let us have this input:

int i = 42;
int *j = &i;
i = (*j)++;

In Clang, from the AST, I can make the following set (with only one
member in this example) of rules:

On the row 3, there is undefined behavior IF i may alias with j.

Now, can I "somehow" send this to LLVM and in LLVM, check whether i
and j do or do not alias?

*****************************
Lukas Hellebrandt
kamikazecz at gmail.com
*****************************

On 11/18/2013 07:54 PM, Richard Smith wrote:
> On Mon, Nov 18, 2013 at 5:17 AM, Lukas Hellebrandt
> <kamikazecz at gmail.com <mailto:kamikazecz at gmail.com>> wrote:
> 
> Hi, Richard,
> 
> Thanks for confirming my thoughts! That's what I meant, make a set
> of rules using Clang and then check them by LLVM.
> 
> As I said, I am relatively new to Clang and have never used LLVM 
> before. So, what's the best way to do this? Do I make a Clang
> plugin? I can do that (I already did a part of the analyzer,
> however, without any alias analysis; and I never did any change to
> the AST). And if yes, how do I send some (meta)data to LLVM? The
> only way I can think of is to save them to a file. And then, do I
> run a LLVM pass? Just asking if I understood correctly. My problem
> is now more about using Clang together with LLVM (as you confirmed
> my approach was a good idea, thanks for that again!).
> 
> 
>> To clarify, I'm suggesting that you teach Clang's IR generation
>> to emit code to catch these bugs when the compiled program is
>> run. That shouldn't require any alias analysis or similar at the
>> point where you insert the checks (if you wanted to be a bit
>> cleverer, you could emit metadata from Clang and use an LLVM pass
>> to emit the checks, and avoid emitting checks for cases where
>> alias analysis could prove they're unnecessary, but the optimizer
>> should remove the redundant checks anyway, so such cleverness
>> should hopefully be unnecessary).
> 
>> We have deliberately avoided having any checks in Clang that
>> operate by inspecting the IR post-optimization, since such checks
>> tend to be very unstable across changes to the compiler, and that
>> generates pain for users of the checks (a seemingly-unrelated
>> change to the code could cause the optimizer to make a different
>> decision and expose a problem, leading to an unexpected build
>> break for a -Werror build).
> 
> 
> ***************************** Lukas Hellebrandt 
> kamikazecz at gmail.com <mailto:kamikazecz at gmail.com> 
> *****************************
> 
> On 11/15/2013 09:46 PM, Richard Smith wrote:
>> On Fri, Nov 15, 2013 at 9:21 AM, Lukas Hellebrandt 
>> <kamikazecz at gmail.com <mailto:kamikazecz at gmail.com>
> <mailto:kamikazecz at gmail.com <mailto:kamikazecz at gmail.com>>>
> wrote:
> 
>> Hi all,
> 
>> I'm trying to write a tool for detecting undefined behavior in C 
>> regarding sequence points and side effects.
> 
>> I'm not sure whether it should be a Clang plugin or LLVM run or 
>> something completely different (I'm really new to both Clang and 
>> LLVM) and that's what I need advice with.
> 
>> For my work, I need to use both AST and alias analysis
> 
>> Clang plugin: +relatively easy to use +access to AST with all
>> the needed info EXCEPT alias analysis (right?) -no alias
>> analysis, I'd need to write one myself
> 
>> LLVM run: +built-in alias analysis (I'd like to use it, writing
>> my own alias analysis is not really what my work is all about) -I
>> do NOT have access to AST -I don't know it at all (but I'm ready
>> to learn it if it shows up to be the best option)
> 
>> The big PROBLEM is: a behavior that is undefined in C (and which 
>> Clang has access to) might be (and in my case WILL be) well
>> defined in LLVM (for example, i=i++; is undefined in C but in
>> LLVM code it will be already well defined and the result will
>> depend on Clang behavior).
> 
>> So I thought I could use both, in Clang create a list of rules, 
>> for example "On line L, there is an undefined behavior if X
>> aliases with Y" and then SOMEHOW dig this info from LLVM run.
> 
>> Is this a good idea? Is there a way (other than output to file 
>> from Clang and then read it in LLVM) to give this set of rules
>> to LLVM? I'd also be glad for any other idea, not necessarily 
>> including LLVM and Clang, to solve this problem.
> 
> 
>> I've thought about this a bit from the point of view of adding a 
>> check to -fsanitize=undefined. We already detect cases where
>> it's "locally" obvious that the expression will have undefined
>> behavior, for instance, we'll warn on your example of "i =
>> i++;".
> 
>> From the ubsan perspective, I think the thing to do here is to
>> form a list of pairs of subexpressions within the full-expression
>> where, if the two subexpressions refer to the same object, we
>> will have unsequenced accesses (in the presence of ?:, &&, and
>> ||, we'll need to be careful to only check these if the relevant
>> subexpression was actually evaluated). Then, perhaps at the end
>> of the full-expression, we emit a check that no such pair
>> evaluated to the same address.
> 
>> Building the set of pairs is a little tricky, but the code 
>> implementing -Wunsequenced can probably be reused to form this 
>> set.
> 
>> That said, we have a representational weakness in LLVM here: we 
>> have no way to express that the language semantics allow us to 
>> freely reorder two memory operations. If we could emit somehow 
>> (perhaps as metadata) the information that we have about 
>> full-expressions, we could use that as input to an LLVM alias 
>> analysis -- then we could use the same representation both to
>> drive a check and to drive optimizations. In the long term,
>> that's probably the superior approach.
> 
> 
>> (Not-fully-baked ideas follow...) We could imagine this working 
>> something like the following. Given:
> 
>> (a++, b) = c ? d++ : e
> 
>> We divide it into separately-sequenced portions:
> 
>> full expression: seq 0 a++: seq 1 for load, seq 2 for store b:
>> seq 3 c: seq 4 d++: seq 5 for load, seq 6 for store e: seq 7 
>> assignment: seq 8
> 
>> We know that:
> 
>> seq 2 and seq 3 are sequenced after seq 1 seq 4 and seq 5 are 
>> sequenced after seq 3 seq 6 is sequenced after seq 5 seq 8 is 
>> sequenced after seq 3, 5, and 7 seq 1-8 are nested within seq 0
>> (so are by default not sequenced with other things inside seq 0)
> 
>> We could then express this as metadata as:
> 
>> %a = load @a, !seq !seq1 %a.1 = add %a, 1 store @a, %a.1, !seq 
>> !seq2 ; ...
> 
>> !seq0 = !{} !seq1 = !{seq0} ; parent is 0 !seq2 = !{seq0,
>> !{seq1}} ; parent is 0, sequenced after 1 !seq3 = !{seq0,
>> !{seq1}} ; parent is 0, sequenced after 1 ; ...
> 
>> ... with the semantics that memory operations A and B are known
>> not to alias if at least one is a load, and they have the same
>> parent, and neither is (transitively) sequenced before the
>> other.
> 
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iF4EAREIAAYFAlKKar8ACgkQHFHs/Czs0u426gD9E6hbIZqqb8Xum5uHEFeEr8kn
i0Y+hztYM2BsZJCWjbUA/iIz16gwIEjwubq18mBAH9flRpNNsjpQuCB+tvk/6mfw
=LFYl
-----END PGP SIGNATURE-----