Fresh numbers after several performance fixes in msan run-time. <div>The average slowdown of msan is 3x (ranges between 2x and 4x). </div><div>When running under msan+DynamoRio (msandr) the average slowdown is 3.6x</div><div>

<br></div><div><div># benchmark  native-time(seconds) msan-slowdown(times) msan+dr-slowdown(times)</div><div>400.perlbench   379     3.71    4.75</div><div><div>401.bzip2       483     3.02    2.93</div><div>403.gcc         312     2.91    4.22</div>

<div>429.mcf         313     2.97    2.97</div><div>445.gobmk       429     3.32    6.41</div><div>456.hmmer       606     3.69    2.55</div><div>458.sjeng       491     3.98    6.39</div><div>462.libquantum  446     2.72    2.18</div>

<div>464.h264ref     573     3.61    5.13</div><div>471.omnetpp     317     2.16    2.91</div><div>473.astar       412     2.26    2.32</div><div>483.xalancbmk   241     3.11    4.99</div><div>433.milc        402     2.25    2.31</div>

<div>444.namd        374     2.53    2.22</div><div>447.dealII      311     2.82    3.39</div><div>450.soplex      232     2.60    2.80</div><div>453.povray      197     3.35    5.28</div><div>470.lbm         321     2.10    2.28</div>

<div>482.sphinx3     496     4.00    3.35</div><div><br></div><div><div>Funny fact: sometimes the DynamoRio tool makes things faster due to "trace optimization". We could probably fix this in the LLVM pass. </div>

<div><br></div><div>Thanks, </div><div>--kcc </div></div><div><br></div><div><br></div><br><div class="gmail_quote">On Thu, Jul 19, 2012 at 7:42 PM, Evgeniy Stepanov <span dir="ltr"><<a href="mailto:eugeni.stepanov@gmail.com" target="_blank">eugeni.stepanov@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">We've solved the issue of external code with the help of dynamic<br>

binary instrumentation. A simple (~500LOK) DynamoRio client zero-fills<br>

shadow memory for all stores and return values in uninstrumented code,<br>

thus avoiding false positives.<br>

<br>

I'm attaching the new patch to LLVM with a slightly improved<br>

instrumentation pass. Repository at<br>

<a href="http://code.google.com/p/memory-sanitizer/source/browse/" target="_blank">http://code.google.com/p/memory-sanitizer/source/browse/</a> contains the<br>

runtime library and the DynamoRio client.<br>

<br>

In its current state, the tool can handle all C and C++ tests in SPEC<br>

CPU2006, finding at least one use of uninitialized data that's also<br>

detected by Memcheck (Valgrind), and one or two false positives coming<br>

from our inexact handling of certain instructions.<br>

<br>

See the performance numbers. "MSan time" was measured by running<br>

without the dynamic tool, initializing shadow to 0x0 instead of 0xFF<br>

to avoid reporting any errors. "MSan/DR time" is from an honest run of<br>

an instrumented benchmark under the DynamoRio-based tool.<br>

<br>

#    test,     native time,   MSan time,   MSan slowdown,    MSan/DR<br>

time,    MSan/DR slowdown<br>

  400.perlbench,        22.40,       116.00,         5.18<br>

138.00,         6.16<br>

      401.bzip2,        41.00,       155.00,         3.78<br>

149.00,         3.63<br>

        403.gcc,         0.79,         3.52,         4.48<br>

10.50,        13.38<br>

        429.mcf,        13.80,        37.00,         2.68<br>

46.70,         3.38<br>

      445.gobmk,        83.10,       328.00,         3.95<br>

440.00,         5.29<br>

      456.hmmer,        59.70,       172.00,         2.88<br>

145.00,         2.43<br>

      458.sjeng,       105.00,       305.00,         2.90<br>

391.00,         3.72<br>

 462.libquantum,         1.66,         4.37,         2.63<br>

3.88,         2.34<br>

    464.h264ref,        71.30,       267.00,         3.74<br>

358.00,         5.02<br>

    471.omnetpp,        42.20,       246.00,         5.83<br>

436.00,        10.33<br>

      473.astar,        82.40,       243.00,         2.95<br>

285.00,         3.46<br>

  483.xalancbmk,        48.40,       644.00,        13.31<br>

1164.00,        24.05<br>

       433.milc,        14.00,        37.60,         2.69<br>

40.80,         2.91<br>

       444.namd,        10.20,        33.20,         3.25<br>

34.20,         3.35<br>

     453.povray,         6.50,        32.20,         4.95<br>

48.80,         7.51<br>

        470.lbm,        31.00,       107.00,         3.45<br>

138.00,         4.45<br>

    482.sphinx3,         7.66,        23.90,         3.12<br>

26.80,         3.50<br>

<div class="HOEnZb"><div class="h5"><br>

<br>

On Wed, Jun 27, 2012 at 10:50 PM, Kostya Serebryany <<a href="mailto:kcc@google.com">kcc@google.com</a>> wrote:<br>

><br>

><br>

> On Wed, Jun 27, 2012 at 7:30 PM, John Criswell <<a href="mailto:criswell@illinois.edu">criswell@illinois.edu</a>><br>

> wrote:<br>

>><br>

>> On 6/26/12 11:20 AM, Kostya Serebryany wrote:<br>

>><br>

>> John,<br>

>><br>

>> On Tue, Jun 26, 2012 at 7:45 PM, John Criswell <<a href="mailto:criswell@illinois.edu">criswell@illinois.edu</a>><br>

>> wrote:<br>

>>><br>

>>> On 6/26/12 4:18 AM, Kostya Serebryany wrote:<br>

>>><br>

>>> Updated the patch to use the new interface for TLS<br>

>>> (GlobalVariable::GeneralDynamicTLSModel).<br>

>>> Comments?<br>

>>><br>

>>><br>

>>> First, there's the non-technical question of whether this belongs in LLVM<br>

>>> mainline or not in its current state.  Do you have users of this tool in the<br>

>>> same way that you had users of ASan?  What platforms have you tested this<br>

>>> tool on?  Do you get false positives when you link with native code<br>

>>> libraries, and if so, is this false positive rate low?  I get the impression<br>

>>> that this tool isn't as mature as TSan or ASan,<br>

>><br>

>><br>

>> Your impression is correct -- msan is not yet usable. It works only on<br>

>> linux-x86_64, only on small libc-free tests.<br>

>> It has been heavily tested (on SPEC and other large programs), but w/o<br>

>> actually poisoning the shadow.<br>

>><br>

>>><br>

>>> so I'm wondering whether it's really ready to go in yet.<br>

>><br>

>><br>

>> Having these 500 LOC in trunk will simplify further development a lot.<br>

>> (we will also need a few lines of driver changes).<br>

>> I can't tell whether this justifies the commit to trunk now (it certainly<br>

>> does for me).<br>

>><br>

>><br>

>> I think it would be better to fix the external code issue first before<br>

>> trying to commit to LLVM trunk; it's the only obstacle I see that prevents<br>

>> you from having a working tool.  That way, you can have a working tool and<br>

>> get feedback on it; you'll also know what the ramifications are for your<br>

>> approach to handling external library code.  As it is right now, I think<br>

>> this new tool is still in too early a stage to add to LLVM.<br>

><br>

><br>

> Having this code in trunk will simplify and speedup the development of the<br>

> dynamic component(s).<br>

> It may also inspire some research in the area of LLVM-based binary<br>

> instrumentation.<br>

> But I'll agree on what the community says.<br>

><br>

><br>

>><br>

>><br>

>><br>

>><br>

>>><br>

>>><br>

>>> Second, on the technical side, I think dealing with the external code<br>

>>> problem has to be addressed before the code can be committed; I suspect the<br>

>>> tool will return too many false positives if you don't address it (if you<br>

>>> have experiments indicating otherwise, please let us know).  I'm assuming<br>

>>> from the comment at the beginning of the pass's implementation that it's<br>

>>> still unaddressed.<br>

>><br>

>><br>

>> We are working on this, in fact experimenting in a few different<br>

>> directions (DynamoRIO, PIN, static instrumentation, even full recompile).<br>

>><br>

>><br>

>> Okay.<br>

>><br>

>><br>

>><br>

>>><br>

>>> I think instrumenting all library code is unrealistic; you'd have to have<br>

>>> every library on your system compiled with the sanitizer (at least for large<br>

>>> GUI applications).<br>

>><br>

>><br>

>> It may still be doable in some restricted environments, but yes, I agree<br>

>> in general.<br>

>><br>

>>><br>

>>> I think there's a few other approaches that you can take:<br>

>>><br>

>>> 1) Assume external code works properly.  To do this, you first ensure<br>

>>> that memory allocated by external code is always marked as initialized.<br>

>>> Second, when passing memory objects from internal, instrumented code to<br>

>>> external library code, you mark the memory as initialized (since you're<br>

>>> assuming the external code will initialize it).<br>

>><br>

>><br>

>> That doesn't work.<br>

>> What if I pass my pointer to a library function.<br>

>> What has to be un-poisoned (marked as inited)?<br>

>> The contents of the heap/stack/bss block to which this pointer belongs?<br>

>><br>

>><br>

>> Yes, that's what I was thinking.<br>

>><br>

>><br>

>> How about pointers that reside in this block? Do we need to go deeper?<br>

>><br>

>><br>

>> Ah, that's a problem.  You might be able to do this conservatively by<br>

>> treating values that point into valid memory objects as pointers (like<br>

>> conservative garbage collection), but it'd be expensive and probably not<br>

>> worth the trouble.<br>

><br>

><br>

> It'll be prohibitively expensive, if at all possible.<br>

> Even with all the hacks we put into the allocator code, checking whether a<br>

> given integer is a currently addressable pointer is not free. For stack<br>

> buffers that's even more expensive.<br>

> Some pointers may point slightly off the buffer (e.g. to -1'th element).<br>

><br>

>><br>

>><br>

>><br>

>><br>

>>><br>

>>><br>

>>> Of course, you can do things like special casing the C library to reduce<br>

>>> false negatives, and you can instruct users to compile library code with the<br>

>>> compiler to find more bugs.<br>

>>><br>

>>> 2) Use something like SAFECode's CompleteChecks pass to determine which<br>

>>> checks are on memory objects handled completely within the analyzed portion<br>

>>> of the program and which checks can access memory handled by external code.<br>

>>> Checks on memory objects handled by external code always pass (or just<br>

>>> generate a warning instead of an error, or something like that).  This<br>

>>> approach requires a good points-to analysis algorithm.  SAFECode uses DSA<br>

>>> which is unification-based.  You can either use that or, if you prefer, use<br>

>>> a subset based algorithm like Anderson's.  I think there's one<br>

>>> implementation of Anderson's for LLVM out there, but I don't know how robust<br>

>>> it is.<br>

>><br>

>><br>

>> I'll have to look at this more, but I don't (yet) see how it can help.<br>

>> Again, suppose<br>

>><br>

>> struct A { int stuff; }<br>

>> struct B { A *a; }<br>

>> struct C { A *a, B* b};<br>

>> ...<br>

>> struct Z { A* a, B *b, C *c, ... Y *y;}<br>

>><br>

>> ...<br>

>> Z *z = new Z;<br>

>> library_function(z);<br>

>><br>

>> library_function can write to any part of Z, but we can't even traverse<br>

>> all of it.<br>

>> Can we do anything?<br>

>><br>

>><br>

>> No, you can't really say anything about the object to which Z points or<br>

>> the objects reachable from the pointer Z.  Any checks on z, z->a, etc. would<br>

>> have to conservatively pass.<br>

><br>

><br>

><br>

>><br>

>> This approach doesn't help you check for initialization errors with<br>

>> library code; it helps prevent library code from triggering false positives.<br>

>> It ensures that all of your checks are performed on memory objects handled<br>

>> by internal code.<br>

>><br>

>> For memory safety, relaxing the safety guarantees for memory objects<br>

>> passed to library code still results in a usable system that catches memory<br>

>> safety errors.  I don't know if the same is true for uninitialized value<br>

>> detection; you'll probably have to experiment and see.<br>

><br>

><br>

> That's not true for uninitialized value detection.<br>

><br>

> both asan and tsan know nothing about the code in the libraries and work<br>

> just fine.<br>

> They miss some percentage of bugs (based on experience with valgrind our<br>

> estimate is 5%-10% of bugs).<br>

> Adding a dynamic component to asan/tsan will add that coverage, and we are<br>

> going to do it.<br>

><br>

> For msan the story is completely different.<br>

> You either cover everything including system calls or don't go there.<br>

> Any memory write which the tool does not observe followed by a read in the<br>

> instrumented code leads to a false report.<br>

><br>

><br>

>><br>

>><br>

>><br>

>><br>

>><br>

>>><br>

>>><br>

>>> 3) Use dynamic binary instrumentation to handle external code.<br>

>>><br>

>>> The first option is probably the easiest to implement.  The second option<br>

>>> requires a points-to analysis that can go into LLVM mainline.  The third<br>

>>> option is a fair amount of work but probably easier than using points-to<br>

>>> analysis and will probably yield a better debugging tool.<br>

>><br>

>><br>

>> We *are* working on #3 (see above).<br>

>><br>

>><br>

>> I think using dynamic binary translation is probably the best option out<br>

>> of the three.  Just take my opinion with the caveat that I haven't run any<br>

>> experiments on uninitialized value usage.<br>

><br>

><br>

> It's actually the worst option, but the only one I know how to implement.<br>

> I'd really prefer a fully static approach (static binary translation, or<br>

> full recompile) but I don't see any suitable technology. :(<br>

><br>

>>><br>

>>><br>

>>> Third, other technical nit-picks:<br>

>>><br>

>>> 1) IMHO, in general, every method and function should have a comment<br>

>>> explaining what it does and the meaning of its inputs and return values.<br>

>>> Descriptions should use complete sentences.<br>

>><br>

>><br>

>> Agree, I'll add comments.<br>

>><br>

>>><br>

>>><br>

>>> 2) You may want to consider using a class derived from InstVisitor<br>

>>> instead of writing large if/then/else or switch code.<br>

>><br>

>><br>

>> I Didn't know about InstVisitor.<br>

>> However, I am not sure if that would be simpler. Will have a closer look.<br>

>><br>

>><br>

>> InstVisitor is faster than using dyn_cast<>() and is the same as using a<br>

>> switch on the instruction opcode (which you code does in some places).  It<br>

>> may make your code more readable.<br>

><br>

><br>

> I am convinced. The new patch uses InstVisitor.<br>

><br>

>><br>

>><br>

>><br>

>><br>

>>><br>

>>><br>

>>> 3) If I understand correctly, a "shadow instruction" propagates<br>

>>> information about initialization status through a computation (so, for<br>

>>> example, if you have a c = add a, b instruction, there's a shadow<br>

>>> instruction that computes c's initialization status based on a and b's<br>

>>> initialization status).<br>

>><br>

>><br>

>> Correct.<br>

>><br>

>>><br>

>>> It looks like you add shadow instructions for all instructions and then<br>

>>> "fill in" the operands to the shadow instructions by iterating over the<br>

>>> instructions a second time.  Is this correct?<br>

>><br>

>><br>

>> Yes.<br>

>><br>

>><br>

>> How do you handle passing taint (i.e., initialization status) between<br>

>> actual parameters and formal parameters in a function call?  Do you store<br>

>> the information to memory and then read it in the callee?<br>

><br>

><br>

> Yes, the shadow is passed via a dedicated thread-local memory slot.<br>

><br>

>><br>

>><br>

>><br>

>><br>

>>><br>

>>> If so, there's an alternative approach: if you fetch the dominator tree<br>

>>> for the function and visit instructions in dominator tree order, then you'll<br>

>>> visit all definitions before all uses.  That will alleviate the need to make<br>

>>> two passes over all the instructions.  The only tricky part is phi nodes,<br>

>>> but for those, you add shadow instructions (phi nodes with undef operands)<br>

>>> to them first and then visit all the other instructions in dominator tree<br>

>>> order; you fill in the operands of the shadow phi nodes at the end.<br>

>>><br>

>>> I'm not sure which approach is faster (I don't think LLVM's dominator<br>

>>> tree pass is the most efficient implementation), but it's something to think<br>

>>> about.<br>

>><br>

>><br>

>> I actually did think about this. I did not experiment with using the<br>

>> dominator tree, but I think that the current way is at least as efficient.<br>

>><br>

>><br>

>> It's really a question of whether computing the dominator tree is faster<br>

>> or slower than iterating over all instructions twice (or whether dominator<br>

>> tree has already been computed for some previously run transform).  I<br>

>> suppose dominator tree might be better if you're not going to instrument all<br>

>> instructions.<br>

><br>

><br>

> Unless I am mistaken depth-first-order is enough.<br>

> The new patch uses that.<br>

><br>

><br>

> Thanks!<br>

><br>

> --kcc<br>

><br>

</div></div><div class="HOEnZb"><div class="h5">> _______________________________________________<br>

> llvm-commits mailing list<br>

> <a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a><br>

> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br>

><br>

</div></div></blockquote></div><br></div></div>