[llvm-commits] MemorySanitizer, first patch

Wed Jun 27 08:30:42 PDT 2012

On 6/26/12 11:20 AM, Kostya Serebryany wrote:
> John,
>
> On Tue, Jun 26, 2012 at 7:45 PM, John Criswell <criswell at illinois.edu 
> <mailto:criswell at illinois.edu>> wrote:
>
>     On 6/26/12 4:18 AM, Kostya Serebryany wrote:
>>     Updated the patch to use the new interface for TLS
>>     (GlobalVariable::GeneralDynamicTLSModel).
>>     Comments?
>
>     First, there's the non-technical question of whether this belongs
>     in LLVM mainline or not in its current state.  Do you have users
>     of this tool in the same way that you had users of ASan?  What
>     platforms have you tested this tool on?  Do you get false
>     positives when you link with native code libraries, and if so, is
>     this false positive rate low?  I get the impression that this tool
>     isn't as mature as TSan or ASan,
>
>
> Your impression is correct -- msan is not yet usable. It works only on 
> linux-x86_64, only on small libc-free tests.
> It has been heavily tested (on SPEC and other large programs), but w/o 
> actually poisoning the shadow.
>
>     so I'm wondering whether it's really ready to go in yet.
>
>
> Having these 500 LOC in trunk will simplify further development a lot.
> (we will also need a few lines of driver changes).
> I can't tell whether this justifies the commit to trunk now (it 
> certainly does for me).

I think it would be better to fix the external code issue first before 
trying to commit to LLVM trunk; it's the only obstacle I see that 
prevents you from having a working tool.  That way, you can have a 
working tool and get feedback on it; you'll also know what the 
ramifications are for your approach to handling external library code.  
As it is right now, I think this new tool is still in too early a stage 
to add to LLVM.

>
>     Second, on the technical side, I think dealing with the external
>     code problem has to be addressed before the code can be committed;
>     I suspect the tool will return too many false positives if you
>     don't address it (if you have experiments indicating otherwise,
>     please let us know).  I'm assuming from the comment at the
>     beginning of the pass's implementation that it's still unaddressed.
>
>
> We are working on this, in fact experimenting in a few different 
> directions (DynamoRIO, PIN, static instrumentation, even full recompile).

Okay.

>     I think instrumenting all library code is unrealistic; you'd have
>     to have every library on your system compiled with the sanitizer
>     (at least for large GUI applications).
>
>
> It may still be doable in some restricted environments, but yes, I 
> agree in general.
>
>     I think there's a few other approaches that you can take:
>
>     1) Assume external code works properly.  To do this, you first
>     ensure that memory allocated by external code is always marked as
>     initialized.  Second, when passing memory objects from internal,
>     instrumented code to external library code, you mark the memory as
>     initialized (since you're assuming the external code will
>     initialize it).
>
>
> That doesn't work.
> What if I pass my pointer to a library function.
> What has to be un-poisoned (marked as inited)?
> The contents of the heap/stack/bss block to which this pointer belongs?

Yes, that's what I was thinking.

> How about pointers that reside in this block? Do we need to go deeper?

Ah, that's a problem.  You might be able to do this conservatively by 
treating values that point into valid memory objects as pointers (like 
conservative garbage collection), but it'd be expensive and probably not 
worth the trouble.

>
>     Of course, you can do things like special casing the C library to
>     reduce false negatives, and you can instruct users to compile
>     library code with the compiler to find more bugs.
>
>     2) Use something like SAFECode's CompleteChecks pass to determine
>     which checks are on memory objects handled completely within the
>     analyzed portion of the program and which checks can access memory
>     handled by external code.  Checks on memory objects handled by
>     external code always pass (or just generate a warning instead of
>     an error, or something like that).  This approach requires a good
>     points-to analysis algorithm.  SAFECode uses DSA which is
>     unification-based.  You can either use that or, if you prefer, use
>     a subset based algorithm like Anderson's.  I think there's one
>     implementation of Anderson's for LLVM out there, but I don't know
>     how robust it is.
>
>
> I'll have to look at this more, but I don't (yet) see how it can help.
> Again, suppose
>
> struct A { int stuff; }
> struct B { A *a; }
> struct C { A *a, B* b};
> ...
> struct Z { A* a, B *b, C *c, ... Y *y;}
>
> ...
> Z *z = new Z;
> library_function(z);
>
> library_function can write to any part of Z, but we can't even 
> traverse all of it.
> Can we do anything?

No, you can't really say anything about the object to which Z points or 
the objects reachable from the pointer Z.  Any checks on z, z->a, etc. 
would have to conservatively pass.

This approach doesn't help you check for initialization errors with 
library code; it helps prevent library code from triggering false 
positives.  It ensures that all of your checks are performed on memory 
objects handled by internal code.

For memory safety, relaxing the safety guarantees for memory objects 
passed to library code still results in a usable system that catches 
memory safety errors.  I don't know if the same is true for 
uninitialized value detection; you'll probably have to experiment and see.

>
>     3) Use dynamic binary instrumentation to handle external code.
>
>     The first option is probably the easiest to implement. The second
>     option requires a points-to analysis that can go into LLVM
>     mainline.  The third option is a fair amount of work but probably
>     easier than using points-to analysis and will probably yield a
>     better debugging tool.
>
>
> We *are* working on #3 (see above).

I think using dynamic binary translation is probably the best option out 
of the three.  Just take my opinion with the caveat that I haven't run 
any experiments on uninitialized value usage.

>
>     Third, other technical nit-picks:
>
>     1) IMHO, in general, every method and function should have a
>     comment explaining what it does and the meaning of its inputs and
>     return values.  Descriptions should use complete sentences.
>
>
> Agree, I'll add comments.
>
>
>     2) You may want to consider using a class derived from InstVisitor
>     instead of writing large if/then/else or switch code.
>
>
> I Didn't know about InstVisitor.
> However, I am not sure if that would be simpler. Will have a closer look.

InstVisitor is faster than using dyn_cast<>() and is the same as using a 
switch on the instruction opcode (which you code does in some places).  
It may make your code more readable.

>
>
>     3) If I understand correctly, a "shadow instruction" propagates
>     information about initialization status through a computation (so,
>     for example, if you have a c = add a, b instruction, there's a
>     shadow instruction that computes c's initialization status based
>     on a and b's initialization status).
>
>
> Correct.
>
>     It looks like you add shadow instructions for all instructions and
>     then "fill in" the operands to the shadow instructions by
>     iterating over the instructions a second time.  Is this correct?
>
>
> Yes.

How do you handle passing taint (i.e., initialization status) between 
actual parameters and formal parameters in a function call? Do you store 
the information to memory and then read it in the callee?

>
>
>     If so, there's an alternative approach: if you fetch the dominator
>     tree for the function and visit instructions in dominator tree
>     order, then you'll visit all definitions before all uses.  That
>     will alleviate the need to make two passes over all the
>     instructions.  The only tricky part is phi nodes, but for those,
>     you add shadow instructions (phi nodes with undef operands) to
>     them first and then visit all the other instructions in dominator
>     tree order; you fill in the operands of the shadow phi nodes at
>     the end.
>
>     I'm not sure which approach is faster (I don't think LLVM's
>     dominator tree pass is the most efficient implementation), but
>     it's something to think about.
>
>
> I actually did think about this. I did not experiment with using 
> the dominator tree, but I think that the current way is at least as 
> efficient.

It's really a question of whether computing the dominator tree is faster 
or slower than iterating over all instructions twice (or whether 
dominator tree has already been computed for some previously run 
transform).  I suppose dominator tree might be better if you're not 
going to instrument all instructions.

-- John T.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20120627/f287cb35/attachment.html>