[llvm-commits] MemorySanitizer, first patch

Tue Jun 26 09:20:36 PDT 2012

John,

On Tue, Jun 26, 2012 at 7:45 PM, John Criswell <criswell at illinois.edu>wrote:

>  On 6/26/12 4:18 AM, Kostya Serebryany wrote:
>
> Updated the patch to use the new interface for TLS
> (GlobalVariable::GeneralDynamicTLSModel).
> Comments?
>
>
> First, there's the non-technical question of whether this belongs in LLVM
> mainline or not in its current state.  Do you have users of this tool in
> the same way that you had users of ASan?  What platforms have you tested
> this tool on?  Do you get false positives when you link with native code
> libraries, and if so, is this false positive rate low?  I get the
> impression that this tool isn't as mature as TSan or ASan,
>

Your impression is correct -- msan is not yet usable. It works only on
linux-x86_64, only on small libc-free tests.
It has been heavily tested (on SPEC and other large programs), but w/o
actually poisoning the shadow.

> so I'm wondering whether it's really ready to go in yet.
>

Having these 500 LOC in trunk will simplify further development a lot.
(we will also need a few lines of driver changes).
I can't tell whether this justifies the commit to trunk now (it certainly
does for me).

>
> Second, on the technical side, I think dealing with the external code
> problem has to be addressed before the code can be committed; I suspect the
> tool will return too many false positives if you don't address it (if you
> have experiments indicating otherwise, please let us know).  I'm assuming
> from the comment at the beginning of the pass's implementation that it's
> still unaddressed.
>

We are working on this, in fact experimenting in a few different directions
(DynamoRIO, PIN, static instrumentation, even full recompile).

> I think instrumenting all library code is unrealistic; you'd have to have
> every library on your system compiled with the sanitizer (at least for
> large GUI applications).
>

It may still be doable in some restricted environments, but yes, I agree in
general.

> I think there's a few other approaches that you can take:
>
> 1) Assume external code works properly.  To do this, you first ensure that
> memory allocated by external code is always marked as initialized.  Second,
> when passing memory objects from internal, instrumented code to external
> library code, you mark the memory as initialized (since you're assuming the
> external code will initialize it).
>

That doesn't work.
What if I pass my pointer to a library function.
What has to be un-poisoned (marked as inited)?
The contents of the heap/stack/bss block to which this pointer belongs?
How about pointers that reside in this block? Do we need to go deeper?

>
> Of course, you can do things like special casing the C library to reduce
> false negatives, and you can instruct users to compile library code with
> the compiler to find more bugs.
>
> 2) Use something like SAFECode's CompleteChecks pass to determine which
> checks are on memory objects handled completely within the analyzed portion
> of the program and which checks can access memory handled by external
> code.  Checks on memory objects handled by external code always pass (or
> just generate a warning instead of an error, or something like that).  This
> approach requires a good points-to analysis algorithm.  SAFECode uses DSA
> which is unification-based.  You can either use that or, if you prefer, use
> a subset based algorithm like Anderson's.  I think there's one
> implementation of Anderson's for LLVM out there, but I don't know how
> robust it is.
>

I'll have to look at this more, but I don't (yet) see how it can help.
Again, suppose

struct A { int stuff; }
struct B { A *a; }
struct C { A *a, B* b};
...
struct Z { A* a, B *b, C *c, ... Y *y;}

...
Z *z = new Z;
library_function(z);

library_function can write to any part of Z, but we can't even traverse all
of it.
Can we do anything?

>
> 3) Use dynamic binary instrumentation to handle external code.
>
> The first option is probably the easiest to implement.  The second option
> requires a points-to analysis that can go into LLVM mainline.  The third
> option is a fair amount of work but probably easier than using points-to
> analysis and will probably yield a better debugging tool.
>

We *are* working on #3 (see above).

>
> Third, other technical nit-picks:
>
> 1) IMHO, in general, every method and function should have a comment
> explaining what it does and the meaning of its inputs and return values.
> Descriptions should use complete sentences.
>

Agree, I'll add comments.

>
> 2) You may want to consider using a class derived from InstVisitor instead
> of writing large if/then/else or switch code.
>

I Didn't know about InstVisitor.
However, I am not sure if that would be simpler. Will have a closer look.

>
> 3) If I understand correctly, a "shadow instruction" propagates
> information about initialization status through a computation (so, for
> example, if you have a c = add a, b instruction, there's a shadow
> instruction that computes c's initialization status based on a and b's
> initialization status).
>

Correct.

> It looks like you add shadow instructions for all instructions and then
> "fill in" the operands to the shadow instructions by iterating over the
> instructions a second time.  Is this correct?
>

Yes.

> If so, there's an alternative approach: if you fetch the dominator tree
> for the function and visit instructions in dominator tree order, then
> you'll visit all definitions before all uses.  That will alleviate the need
> to make two passes over all the instructions.  The only tricky part is phi
> nodes, but for those, you add shadow instructions (phi nodes with undef
> operands) to them first and then visit all the other instructions in
> dominator tree order; you fill in the operands of the shadow phi nodes at
> the end.
>
> I'm not sure which approach is faster (I don't think LLVM's dominator tree
> pass is the most efficient implementation), but it's something to think
> about.
>

I actually did think about this. I did not experiment with using
the dominator tree, but I think that the current way is at least as
efficient.
Another point is that this will make the implementation simpler -- I'll
give it a try.

Thanks!

--kcc

>
> -- John T.
>
>
>
>
>  On Fri, Jun 22, 2012 at 8:26 PM, Kostya Serebryany <kcc at google.com>wrote:
>
>> Any comments? Objections?
>> May I commit with post-commit review?
>>
>>  --kcc
>>
>>
>>  On Tue, Jun 19, 2012 at 11:33 AM, Kostya Serebryany <kcc at google.com>wrote:
>>
>>> Hello,
>>>
>>>  Please review the following patch which adds MemorySanitizer
>>> instrumentation pass.
>>> For the detailed description see
>>> http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-June/050821.html
>>> The patch contains instrumentation pass itself and basic LLVM-ish tests.
>>> Clang flag (-fmemory-sanitizer), the run-time library and more tests
>>> will follow.
>>>
>>>  Code review URL: http://codereview.appspot.com/6298091/
>>>
>>>  Thanks,
>>>
>>>  --kcc
>>>
>>
>>
>
>
> _______________________________________________
> llvm-commits mailing listllvm-commits at cs.uiuc.eduhttp://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20120626/c1fd73d4/attachment.html>