<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 6/26/12 11:20 AM, Kostya Serebryany
wrote:<br>
</div>
<blockquote
cite="mid:CAN=P9pgDOpm6DR1gN+2QOHyhThxscehZSdm7gvFaHQG95Z0c+w@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
<div style="font-family: arial,helvetica,sans-serif;"><font
size="2">John, <br>
<br>
<div class="gmail_quote">On Tue, Jun 26, 2012 at 7:45 PM, John
Criswell <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:criswell@illinois.edu" target="_blank">criswell@illinois.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<div class="im">
<div>On 6/26/12 4:18 AM, Kostya Serebryany wrote:<br>
</div>
<blockquote type="cite">
<div style="font-family:arial,helvetica,sans-serif"><font
size="2">Updated the patch to use the new
interface for TLS
(GlobalVariable::GeneralDynamicTLSModel).
<div>Comments? <br>
</div>
</font></div>
</blockquote>
<br>
</div>
First, there's the non-technical question of whether
this belongs in LLVM mainline or not in its current
state. Do you have users of this tool in the same way
that you had users of ASan? What platforms have you
tested this tool on? Do you get false positives when
you link with native code libraries, and if so, is this
false positive rate low? I get the impression that this
tool isn't as mature as TSan or ASan, </div>
</blockquote>
<div><br>
</div>
<div>
<div>Your impression is correct -- msan is not yet
usable. It works only on linux-x86_64, only on small
libc-free tests. </div>
<div>It has been heavily tested (on SPEC and other large
programs), but w/o actually poisoning the shadow.</div>
<div> </div>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">so I'm wondering
whether it's really ready to go in yet.<br>
</div>
</blockquote>
<div><br>
</div>
<div>Having these 500 LOC in trunk will simplify further
development a lot. </div>
<div>(we will also need a few lines of driver changes). </div>
<div>I can't tell whether this justifies the commit to trunk
now (it certainly does for me). <br>
</div>
</div>
</font></div>
</blockquote>
<br>
I think it would be better to fix the external code issue first
before trying to commit to LLVM trunk; it's the only obstacle I see
that prevents you from having a working tool. That way, you can
have a working tool and get feedback on it; you'll also know what
the ramifications are for your approach to handling external library
code. As it is right now, I think this new tool is still in too
early a stage to add to LLVM.<br>
<br>
<blockquote
cite="mid:CAN=P9pgDOpm6DR1gN+2QOHyhThxscehZSdm7gvFaHQG95Z0c+w@mail.gmail.com"
type="cite">
<div style="font-family: arial,helvetica,sans-serif;"><font
size="2">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
Second, on the technical side, I think dealing with the
external code problem has to be addressed before the
code can be committed; I suspect the tool will return
too many false positives if you don't address it (if you
have experiments indicating otherwise, please let us
know). I'm assuming from the comment at the beginning
of the pass's implementation that it's still
unaddressed.<br>
</div>
</blockquote>
<div><br>
</div>
<div>We are working on this, in fact experimenting in a few
different directions (DynamoRIO, PIN, static
instrumentation, even full recompile).</div>
</div>
</font></div>
</blockquote>
<br>
Okay.<br>
<br>
<font size="2"></font>
<blockquote
cite="mid:CAN=P9pgDOpm6DR1gN+2QOHyhThxscehZSdm7gvFaHQG95Z0c+w@mail.gmail.com"
type="cite">
<div style="font-family: arial,helvetica,sans-serif;"><font
size="2">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> I think
instrumenting all library code is unrealistic; you'd
have to have every library on your system compiled with
the sanitizer (at least for large GUI applications). </div>
</blockquote>
<div><br>
</div>
<div>It may still be doable in some restricted environments,
but yes, I agree in general. </div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> I think there's a
few other approaches that you can take:<br>
<br>
1) Assume external code works properly. To do this, you
first ensure that memory allocated by external code is
always marked as initialized. Second, when passing
memory objects from internal, instrumented code to
external library code, you mark the memory as
initialized (since you're assuming the external code
will initialize it).<br>
</div>
</blockquote>
<div><br>
</div>
<div>That doesn't work. </div>
<div>What if I pass my pointer to a library function. </div>
<div>What has to be un-poisoned (marked as inited)? </div>
<div>The contents of the heap/stack/bss block to which this
pointer belongs? <br>
</div>
</div>
</font></div>
</blockquote>
<br>
Yes, that's what I was thinking.<br>
<br>
<blockquote
cite="mid:CAN=P9pgDOpm6DR1gN+2QOHyhThxscehZSdm7gvFaHQG95Z0c+w@mail.gmail.com"
type="cite">
<div style="font-family: arial,helvetica,sans-serif;"><font
size="2">
<div class="gmail_quote">
<div>How about pointers that reside in this block? Do we
need to go deeper? <br>
</div>
</div>
</font></div>
</blockquote>
<br>
Ah, that's a problem. You might be able to do this conservatively
by treating values that point into valid memory objects as pointers
(like conservative garbage collection), but it'd be expensive and
probably not worth the trouble.<br>
<br>
<blockquote
cite="mid:CAN=P9pgDOpm6DR1gN+2QOHyhThxscehZSdm7gvFaHQG95Z0c+w@mail.gmail.com"
type="cite">
<div style="font-family: arial,helvetica,sans-serif;"><font
size="2">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
Of course, you can do things like special casing the C
library to reduce false negatives, and you can instruct
users to compile library code with the compiler to find
more bugs.<br>
<br>
2) Use something like SAFECode's CompleteChecks pass to
determine which checks are on memory objects handled
completely within the analyzed portion of the program
and which checks can access memory handled by external
code. Checks on memory objects handled by external code
always pass (or just generate a warning instead of an
error, or something like that). This approach requires
a good points-to analysis algorithm. SAFECode uses DSA
which is unification-based. You can either use that or,
if you prefer, use a subset based algorithm like
Anderson's. I think there's one implementation of
Anderson's for LLVM out there, but I don't know how
robust it is.<br>
</div>
</blockquote>
<div><br>
</div>
<div>I'll have to look at this more, but I don't (yet) see
how it can help. </div>
<div>Again, suppose</div>
<div><br>
</div>
<div>struct A { int stuff; }</div>
<div>struct B { A *a; }</div>
<div>struct C { A *a, B* b};</div>
<div>...</div>
<div>struct Z { A* a, B *b, C *c, ... Y *y;}</div>
<div><br>
</div>
<div>...</div>
<div>Z *z = new Z;</div>
<div>library_function(z);</div>
<div><br>
</div>
<div>library_function can write to any part of Z, but we
can't even traverse all of it. </div>
<div>Can we do anything? <br>
</div>
</div>
</font></div>
</blockquote>
<br>
No, you can't really say anything about the object to which Z points
or the objects reachable from the pointer Z. Any checks on z,
z->a, etc. would have to conservatively pass.<br>
<br>
This approach doesn't help you check for initialization errors with
library code; it helps prevent library code from triggering false
positives. It ensures that all of your checks are performed on
memory objects handled by internal code.<br>
<br>
For memory safety, relaxing the safety guarantees for memory objects
passed to library code still results in a usable system that catches
memory safety errors. I don't know if the same is true for
uninitialized value detection; you'll probably have to experiment
and see.<br>
<br>
<br>
<blockquote
cite="mid:CAN=P9pgDOpm6DR1gN+2QOHyhThxscehZSdm7gvFaHQG95Z0c+w@mail.gmail.com"
type="cite">
<div style="font-family: arial,helvetica,sans-serif;"><font
size="2">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
3) Use dynamic binary instrumentation to handle external
code.<br>
<br>
The first option is probably the easiest to implement.
The second option requires a points-to analysis that can
go into LLVM mainline. The third option is a fair
amount of work but probably easier than using points-to
analysis and will probably yield a better debugging
tool.<br>
</div>
</blockquote>
<div><br>
</div>
<div>We *are* working on #3 (see above). <br>
</div>
</div>
</font></div>
</blockquote>
<br>
I think using dynamic binary translation is probably the best option
out of the three. Just take my opinion with the caveat that I
haven't run any experiments on uninitialized value usage.<br>
<br>
<blockquote
cite="mid:CAN=P9pgDOpm6DR1gN+2QOHyhThxscehZSdm7gvFaHQG95Z0c+w@mail.gmail.com"
type="cite">
<div style="font-family: arial,helvetica,sans-serif;"><font
size="2">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
Third, other technical nit-picks:<br>
<br>
1) IMHO, in general, every method and function should
have a comment explaining what it does and the meaning
of its inputs and return values. Descriptions should
use complete sentences.<br>
</div>
</blockquote>
<div><br>
</div>
<div>Agree, I'll add comments. </div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
2) You may want to consider using a class derived from
InstVisitor instead of writing large if/then/else or
switch code.<br>
</div>
</blockquote>
<div><br>
</div>
<div>I Didn't know about InstVisitor. </div>
<div>However, I am not sure if that would be simpler. Will
have a closer look. <br>
</div>
</div>
</font></div>
</blockquote>
<br>
InstVisitor is faster than using dyn_cast<>() and is the same
as using a switch on the instruction opcode (which you code does in
some places). It may make your code more readable.<br>
<br>
<blockquote
cite="mid:CAN=P9pgDOpm6DR1gN+2QOHyhThxscehZSdm7gvFaHQG95Z0c+w@mail.gmail.com"
type="cite">
<div style="font-family: arial,helvetica,sans-serif;"><font
size="2">
<div class="gmail_quote">
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
3) If I understand correctly, a "shadow instruction"
propagates information about initialization status
through a computation (so, for example, if you have a c
= add a, b instruction, there's a shadow instruction
that computes c's initialization status based on a and
b's initialization status). </div>
</blockquote>
<div><br>
</div>
<div>Correct. </div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
It looks like you add shadow instructions for all
instructions and then "fill in" the operands to the
shadow instructions by iterating over the instructions a
second time. Is this correct?<br>
</div>
</blockquote>
<div><br>
</div>
<div>Yes. <br>
</div>
</div>
</font></div>
</blockquote>
<br>
How do you handle passing taint (i.e., initialization status)
between actual parameters and formal parameters in a function call?
Do you store the information to memory and then read it in the
callee?<br>
<br>
<blockquote
cite="mid:CAN=P9pgDOpm6DR1gN+2QOHyhThxscehZSdm7gvFaHQG95Z0c+w@mail.gmail.com"
type="cite">
<div style="font-family: arial,helvetica,sans-serif;"><font
size="2">
<div class="gmail_quote">
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
If so, there's an alternative approach: if you fetch the
dominator tree for the function and visit instructions
in dominator tree order, then you'll visit all
definitions before all uses. That will alleviate the
need to make two passes over all the instructions. The
only tricky part is phi nodes, but for those, you add
shadow instructions (phi nodes with undef operands) to
them first and then visit all the other instructions in
dominator tree order; you fill in the operands of the
shadow phi nodes at the end.<br>
<br>
I'm not sure which approach is faster (I don't think
LLVM's dominator tree pass is the most efficient
implementation), but it's something to think about.<br>
</div>
</blockquote>
<div><br>
</div>
<div>I actually did think about this. I did not experiment
with using the dominator tree, but I think that the
current way is at least as efficient.</div>
</div>
</font></div>
</blockquote>
<br>
It's really a question of whether computing the dominator tree is
faster or slower than iterating over all instructions twice (or
whether dominator tree has already been computed for some previously
run transform). I suppose dominator tree might be better if you're
not going to instrument all instructions.<br>
<br>
-- John T.<br>
<br>
</body>
</html>