<div dir="ltr">On Mon, Nov 18, 2013 at 11:30 AM, Lukas Hellebrandt <span dir="ltr"><<a href="mailto:kamikazecz@gmail.com" target="_blank">kamikazecz@gmail.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">-----BEGIN PGP SIGNED MESSAGE-----<br>
Hash: SHA256<br>
<br>
</div>Your last message confused me a lot. Are you suggesting inserting<br>
run-time checks into the AST? What about the whole metadata-thing from<br>
your original message? I thought those were metadata for LLVM so I<br>
could check the must-not-alias rules defined in them.<br></blockquote><div><br></div><div>The idea I was expressing there was that emitting the sequencing rules as metadata would allow LLVM to optimize better by using them as a source of aliasing information.</div>
<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Run-time check is not my goal, I am trying to write a static check (of<br>
course, false positives will happen). I understand your warning about<br>
changes in the compiler, but anyway, is it possible to check this<br>
statically (using the set of rules I am talking about, and maybe the<br>
same rules you wrote about, but as I say, I'm confused about it)?<br>
<br>
A short example so I make it clear:<br>
<br>
Let us have this input:<br>
<br>
int i = 42;<br>
int *j = &i;<br>
i = (*j)++;<br>
<br>
In Clang, from the AST, I can make the following set (with only one<br>
member in this example) of rules:<br>
<br>
On the row 3, there is undefined behavior IF i may alias with j.<br>
<br>
Now, can I "somehow" send this to LLVM and in LLVM, check whether i<br>
and j do or do not alias?</blockquote><div><br></div><div>Yes, this is possible. If you want to write a check that can be pushed into upstream Clang, I would not advise doing it this way, because (as mentioned) we have gone to lengths to avoid such checks. Instead, I'd suggest writing a static analysis checker and/or a sanitizer (insert instrumentation to detect problems when the compiled code is run).</div>
<div><br></div><div>If you don't care about pushing this upstream, this is something you can reasonably check in a custom LLVM pass, with some custom Clang code to emit metadata to track the sequencing rules. You would attach the metadata to some of the instructions emitted by Clang (attaching to loads and stores probably makes most sense here), then write a custom LLVM pass that looks at the metadata and determines whether unsequenced modifications have happened through aliasing pointers.</div>
<div><br></div><div>I would expect you'd get significantly better results (fewer false negatives) through runtime instrumentation, though.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">
*****************************<br>
Lukas Hellebrandt<br>
<a href="mailto:kamikazecz@gmail.com">kamikazecz@gmail.com</a><br>
*****************************<br>
<br>
</div><div class="im">On 11/18/2013 07:54 PM, Richard Smith wrote:<br>
> On Mon, Nov 18, 2013 at 5:17 AM, Lukas Hellebrandt<br>
</div><div><div class="h5">> <<a href="mailto:kamikazecz@gmail.com">kamikazecz@gmail.com</a> <mailto:<a href="mailto:kamikazecz@gmail.com">kamikazecz@gmail.com</a>>> wrote:<br>
><br>
> Hi, Richard,<br>
><br>
> Thanks for confirming my thoughts! That's what I meant, make a set<br>
> of rules using Clang and then check them by LLVM.<br>
><br>
> As I said, I am relatively new to Clang and have never used LLVM<br>
> before. So, what's the best way to do this? Do I make a Clang<br>
> plugin? I can do that (I already did a part of the analyzer,<br>
> however, without any alias analysis; and I never did any change to<br>
> the AST). And if yes, how do I send some (meta)data to LLVM? The<br>
> only way I can think of is to save them to a file. And then, do I<br>
> run a LLVM pass? Just asking if I understood correctly. My problem<br>
> is now more about using Clang together with LLVM (as you confirmed<br>
> my approach was a good idea, thanks for that again!).<br>
><br>
><br>
>> To clarify, I'm suggesting that you teach Clang's IR generation<br>
>> to emit code to catch these bugs when the compiled program is<br>
>> run. That shouldn't require any alias analysis or similar at the<br>
>> point where you insert the checks (if you wanted to be a bit<br>
>> cleverer, you could emit metadata from Clang and use an LLVM pass<br>
>> to emit the checks, and avoid emitting checks for cases where<br>
>> alias analysis could prove they're unnecessary, but the optimizer<br>
>> should remove the redundant checks anyway, so such cleverness<br>
>> should hopefully be unnecessary).<br>
><br>
>> We have deliberately avoided having any checks in Clang that<br>
>> operate by inspecting the IR post-optimization, since such checks<br>
>> tend to be very unstable across changes to the compiler, and that<br>
>> generates pain for users of the checks (a seemingly-unrelated<br>
>> change to the code could cause the optimizer to make a different<br>
>> decision and expose a problem, leading to an unexpected build<br>
>> break for a -Werror build).<br>
><br>
><br>
> ***************************** Lukas Hellebrandt<br>
</div></div>> <a href="mailto:kamikazecz@gmail.com">kamikazecz@gmail.com</a> <mailto:<a href="mailto:kamikazecz@gmail.com">kamikazecz@gmail.com</a>><br>
<div class="im">> *****************************<br>
><br>
> On 11/15/2013 09:46 PM, Richard Smith wrote:<br>
>> On Fri, Nov 15, 2013 at 9:21 AM, Lukas Hellebrandt<br>
>> <<a href="mailto:kamikazecz@gmail.com">kamikazecz@gmail.com</a> <mailto:<a href="mailto:kamikazecz@gmail.com">kamikazecz@gmail.com</a>><br>
</div>> <mailto:<a href="mailto:kamikazecz@gmail.com">kamikazecz@gmail.com</a> <mailto:<a href="mailto:kamikazecz@gmail.com">kamikazecz@gmail.com</a>>>><br>
<div><div class="h5">> wrote:<br>
><br>
>> Hi all,<br>
><br>
>> I'm trying to write a tool for detecting undefined behavior in C<br>
>> regarding sequence points and side effects.<br>
><br>
>> I'm not sure whether it should be a Clang plugin or LLVM run or<br>
>> something completely different (I'm really new to both Clang and<br>
>> LLVM) and that's what I need advice with.<br>
><br>
>> For my work, I need to use both AST and alias analysis<br>
><br>
>> Clang plugin: +relatively easy to use +access to AST with all<br>
>> the needed info EXCEPT alias analysis (right?) -no alias<br>
>> analysis, I'd need to write one myself<br>
><br>
>> LLVM run: +built-in alias analysis (I'd like to use it, writing<br>
>> my own alias analysis is not really what my work is all about) -I<br>
>> do NOT have access to AST -I don't know it at all (but I'm ready<br>
>> to learn it if it shows up to be the best option)<br>
><br>
>> The big PROBLEM is: a behavior that is undefined in C (and which<br>
>> Clang has access to) might be (and in my case WILL be) well<br>
>> defined in LLVM (for example, i=i++; is undefined in C but in<br>
>> LLVM code it will be already well defined and the result will<br>
>> depend on Clang behavior).<br>
><br>
>> So I thought I could use both, in Clang create a list of rules,<br>
>> for example "On line L, there is an undefined behavior if X<br>
>> aliases with Y" and then SOMEHOW dig this info from LLVM run.<br>
><br>
>> Is this a good idea? Is there a way (other than output to file<br>
>> from Clang and then read it in LLVM) to give this set of rules<br>
>> to LLVM? I'd also be glad for any other idea, not necessarily<br>
>> including LLVM and Clang, to solve this problem.<br>
><br>
><br>
>> I've thought about this a bit from the point of view of adding a<br>
>> check to -fsanitize=undefined. We already detect cases where<br>
>> it's "locally" obvious that the expression will have undefined<br>
>> behavior, for instance, we'll warn on your example of "i =<br>
>> i++;".<br>
><br>
>> From the ubsan perspective, I think the thing to do here is to<br>
>> form a list of pairs of subexpressions within the full-expression<br>
>> where, if the two subexpressions refer to the same object, we<br>
>> will have unsequenced accesses (in the presence of ?:, &&, and<br>
>> ||, we'll need to be careful to only check these if the relevant<br>
>> subexpression was actually evaluated). Then, perhaps at the end<br>
>> of the full-expression, we emit a check that no such pair<br>
>> evaluated to the same address.<br>
><br>
>> Building the set of pairs is a little tricky, but the code<br>
>> implementing -Wunsequenced can probably be reused to form this<br>
>> set.<br>
><br>
>> That said, we have a representational weakness in LLVM here: we<br>
>> have no way to express that the language semantics allow us to<br>
>> freely reorder two memory operations. If we could emit somehow<br>
>> (perhaps as metadata) the information that we have about<br>
>> full-expressions, we could use that as input to an LLVM alias<br>
>> analysis -- then we could use the same representation both to<br>
>> drive a check and to drive optimizations. In the long term,<br>
>> that's probably the superior approach.<br>
><br>
><br>
>> (Not-fully-baked ideas follow...) We could imagine this working<br>
>> something like the following. Given:<br>
><br>
>> (a++, b) = c ? d++ : e<br>
><br>
>> We divide it into separately-sequenced portions:<br>
><br>
>> full expression: seq 0 a++: seq 1 for load, seq 2 for store b:<br>
>> seq 3 c: seq 4 d++: seq 5 for load, seq 6 for store e: seq 7<br>
>> assignment: seq 8<br>
><br>
>> We know that:<br>
><br>
>> seq 2 and seq 3 are sequenced after seq 1 seq 4 and seq 5 are<br>
>> sequenced after seq 3 seq 6 is sequenced after seq 5 seq 8 is<br>
>> sequenced after seq 3, 5, and 7 seq 1-8 are nested within seq 0<br>
>> (so are by default not sequenced with other things inside seq 0)<br>
><br>
>> We could then express this as metadata as:<br>
><br>
>> %a = load @a, !seq !seq1 %a.1 = add %a, 1 store @a, %a.1, !seq<br>
>> !seq2 ; ...<br>
><br>
>> !seq0 = !{} !seq1 = !{seq0} ; parent is 0 !seq2 = !{seq0,<br>
>> !{seq1}} ; parent is 0, sequenced after 1 !seq3 = !{seq0,<br>
>> !{seq1}} ; parent is 0, sequenced after 1 ; ...<br>
><br>
>> ... with the semantics that memory operations A and B are known<br>
>> not to alias if at least one is a load, and they have the same<br>
>> parent, and neither is (transitively) sequenced before the<br>
>> other.<br>
><br>
><br>
-----BEGIN PGP SIGNATURE-----<br>
Version: GnuPG v1.4.13 (GNU/Linux)<br>
Comment: Using GnuPG with Thunderbird - <a href="http://www.enigmail.net/" target="_blank">http://www.enigmail.net/</a><br>
<br>
</div></div>iF4EAREIAAYFAlKKar8ACgkQHFHs/Czs0u426gD9E6hbIZqqb8Xum5uHEFeEr8kn<br>
i0Y+hztYM2BsZJCWjbUA/iIz16gwIEjwubq18mBAH9flRpNNsjpQuCB+tvk/6mfw<br>
=LFYl<br>
-----END PGP SIGNATURE-----<br>
</blockquote></div><br></div></div>