[LLVMdev] llvm DSA - reproduce the result in PLDI 07 paper

Thu May 7 07:19:19 PDT 2015

On 5/6/15 11:07 PM, Wan Zhiyuan wrote:
> Dear John,
> Thank you so much for your reply!
> Please kindly correct me if my understanding is wrong.
>
> From your reply, I learn that there could be mainly four reasons why 
> DSA's precision is low:
> 1. type inference
> 2. influence of external code
> 3. bug in handling function pointers

To be clear, this is a fault in the algorithm itself (i.e., a completely 
correct implementation would still have the problem).  No one at the 
time was aware of this flaw in the algorithm; a master's student 
discovered it several years after the paper was published.

> 4. using the wrong DSA pass
>
> As you mentioned in your previous mail, LLVM changes a lot.
> So I was wondering if DSA's algorithm (presented in PLDI '07) can 
> achieve a comparable precision on top of LLVM 3.5+ after implementing 
> the improvements.

That's an open question and dependent on a lot of things.

I also think you're asking the wrong question.  It sounds like you want 
to fix DSA because doing so will make it more precise, but you haven't 
discussed why you need that precision.

In my opinion, fixing DSA will be a lot of work (2-3 months at least).  
I would not spend time fixing DSA just to see if some synthetic 
precision results get better.  I would fix DSA only if I was writing an 
analysis or transformation that needed points-to, type-inference, and/or 
call graph results more accurate than what DSA is currently providing 
*and* if fixing DSA was the cheapest option in getting those more 
accurate results.

>
> Also, I have diff the DSA code in poolalloc release_19 and release_32. 
> It seems to me that the changes in DSA mainly comes from the aspects 
> as below:
> 1. release_32 introduces the DataLayout to do some type inference;

Not really.  DataLayout used to be called TargetData.  Someone changed 
the name within the past year or so.

> 2. release_32 replaces some data structures with LLVM built-in data 
> structures;
> 3. release_32 handles different types of LLVM IR instructions, and in 
> a different way.
> 4. the inheritance of Analysis Passes has changed a bit.
>
> Therefore, I was wondering if the introduction of DataLayout and 
> changes in LLVM IR cause the imprecision as well.

You're looking at all the cosmetic changes needed to make the DSA code 
compile with modern LLVM.  None of these things is what makes DSA less 
precise.

When I say that changes in LLVM have caused DSA to become less precise, 
I am talking about two issues.  First, newer versions of the LLVM 
front-end generate LLVM IR that DSA does not analyze well. For example, 
after LLVM 1.9, the C front-ends started changing the way they generated 
code for vararg functions.  DSA treated the new code conservatively.

Second, some of the LLVM optimization passes started replacing 
well-typed GEP instructions with code that casted the pointer to a char 
*, did a GEP, and then casted the pointer back to its original type.  
Since DSA wasn't designed to analyze this idiom, it just thought that 
the pointer was being used like an untyped byte array. This made DSA's 
results a lot less precise because it dropped field sensitivity for 
these memory objects.

My suggestion on replacing DSA's type-inference code with something 
similar to VSA is my idea on how to address the second issue in a way 
that will avoid having to change DSA every time the LLVM optimizations 
change what they do.  In essence, by making DSA less reliant on the LLVM 
type information, the LLVM optimizations can do whatever they want with 
the type information, and DSA should still continue to get the same results.

To summarize, DSA lost precision because it made assumptions about the 
LLVM IR it would be analyzing.  When LLVM changed in ways that broke 
those assumptions, DSA started generating less precise results.

Regards,

John Criswell

>
> Thank you!
>
> Regards,
>
> Zhiyuan Wan
>
>
> On Wed, May 6, 2015 at 10:35 AM, John Criswell <jtcriswel at gmail.com 
> <mailto:jtcriswel at gmail.com>> wrote:
>
>     On 5/5/15 5:33 PM, Wan Zhiyuan wrote:
>>     Dear John,
>>     I intend to implement the improvements on DSA.
>
>     There be nasty dragons in DSA.  Don't say I didn't warn you.
>     :)
>
>     On a side note, I recently discovered that someone (I think Will
>     Dietz) updated the code to work with LLVM mainline, so you should
>     be able to update to a newer version of LLVM and use DSA.  I've
>     created a snapshot of the LLVM source tree and a modified version
>     of the poolalloc tree in https://github.com/jtcriswell/llvm-dsa
>     that you may find useful.  I think Will is also maintaining a
>     github tree that he updates regularly.
>
>>     After running DSA on SPEC, I found DSA gives low precision for
>>     mcf and bzip2.
>>     I have checked the most imprecise c files in mcf an found that
>>     the code seems to be a mixture of "PHI" and "GEP" instructions.
>>
>>     Could you please give me some hints about what the big picture of
>>     the improvement should be and how to start?
>
>     There may be many reasons why DSA's precision is low.  You'll need
>     to figure out what the reason is for each program.  In some cases,
>     it may be type inference. In other cases, it may be that DSA is
>     cognizant of the influence of external code (which causes it to
>     correctly give pessimistic results).  For programs using function
>     pointers, there's a bug in the algorithm in which DSNodes that
>     should be marked complete are not.  It could also be that you're
>     using the wrong DSA pass (e.g., using EQTD when TD will do).
>
>     Looking at type inference specifically, the problem, in a
>     nutshell, is that DSA's type inference should not rely upon LLVM's
>     type annotations.  It should just create a map from offsets to
>     types.  Some work on this has already been done (e.g., DSA can
>     figure out that casting a pointer to a structure into a pointer to
>     the type of the structure's first field is still offset 0). 
>     However, there are still places in which DSA is relying upon
>     LLVM's type annotations.
>
>     One thing that I would like to look at is changing how DSA
>     analyzes arrays of structures.  Right now, DSA tries to infer a
>     structure type for a memory object and then tries to infer whether
>     the memory object is a singleton structure or an array of
>     structures (you can see this in DSA's interface; you can see a map
>     between offsets and types and an 'A' flag indicating that the
>     object is an array).  I think this makes DSA needlessly complicated.
>
>     I think it would be better if DSA did what Value Set Analysis
>     (VSA) does.  VSA was designed to analyze untyped binary code.  For
>     each abstract memory object, it creates a map of 4-tuples to
>     types.  Each 4-tuple represents a formula ax+b as (a,b,c,d) in
>     which b if offset, a is stride, and x is constrained between
>     values c and d (c and d can be constants or +-infinity).  For
>     example, if you have an array of struct {int a ; char * b} on a
>     32-bit machine, DSA currently tries to figure out that there's an
>     array of structure elements in which there's an int at offset 0
>     and a char * at offset 4 within the structure. VSA would say that
>     there's a memory object with an int at every multiple of 4x+0
>     offset and a char * at every 4x+4 offset.
>
>     The VSA approach should be agnostic to all sorts of weird casting,
>     embedded arrays, and embedded unions, though this is an educated
>     guess at present.
>
>     Regards,
>
>     John Criswell
>
>
>>
>>     Thank you!
>>
>>     Regards,
>>     Zhiyuan
>>
>>     On Mon, Apr 6, 2015 at 5:22 PM, John Criswell
>>     <jtcriswel at gmail.com <mailto:jtcriswel at gmail.com>> wrote:
>>
>>         Dear Zhiyuan,
>>
>>         In order to reproduce the results from the paper, you'll need
>>         to replicate a system from that era.  You'll need to use the
>>         same version of LLVM and DSA that the paper used.  I think
>>         that was LLVM 1.9 (the release_19 branch of LLVM and
>>         poolalloc), but I'm not sure.  You should check to see if the
>>         paper specifies the version.
>>
>>         As you'll be using a very old version of LLVM, it may be
>>         worth setting up a VM with a corresponding old version of
>>         Linux.  I suspect newer compilers will not compile a version
>>         of LLVM that is that old, so using an old version of Linux
>>         with an old version of GCC may be needed.  I think Fedora
>>         Core 2 is the OS we were using at the time.
>>
>>         To answer the question of why you can't use a modern version
>>         of LLVM and poolalloc, it's because LLVM has changed
>>         significantly.  DSA relies upon the type annotations provided
>>         in the LLVM IR to "bootstrap" its type inference (bootstrap
>>         is not quite the right word, but it's the closest one of
>>         which I could think).  As LLVM matured, transformations would
>>         ditch the type information (e.g., transforming typed GEPs
>>         into untyped GEPs into a byte array), making DSA's ability to
>>         do type-inference (and thereby improving field sensitivity)
>>         more difficult.  Throw into the mix the fact that DSA is
>>         maintained by an academic research group, and the result is
>>         that modern DSA doesn't have the accuracy that the original
>>         DSA did.
>>
>>         The good news is that I think DSA can be fixed by making its
>>         type-inferencing code smarter.  The bad news is that it'd be
>>         a fair amount of work to do.  So far, no one has had
>>         sufficient desire/motivation to design and implement the
>>         improvements.
>>
>>         Regards,
>>
>>         John Criswell
>>
>>
>>
>>         On 4/6/15 4:56 PM, Wan Zhiyuan wrote:
>>>         Dear all,
>>>         I am trying to reproduce the "Percent May Alias" result
>>>         described in PLDI 07's paper "Making Context-Sensitive
>>>         Points-to Analysis with Heap Cloning Practical For The Real
>>>         World" (http://llvm.org/pubs/2007-06-10-PLDI-DSA.html).
>>>
>>>         However, my "Percent May Alias" for all the benchmarks is
>>>         much greater, especially "bzip2".
>>>
>>>         The DSA code I use is from
>>>         "https://github.com/smackers/smack". I have diff the code
>>>         between smack and poolalloc_32. They are almost the same
>>>         except the "#include" statements.
>>>
>>>         I was wondering whether I need to do some configuration to
>>>         make DSA work properly.
>>>
>>>         Thank you!
>>>
>>>         Zhiyuan
>>>
>>>
>>>
>>>
>>>
>>>         _______________________________________________
>>>         LLVM Developers mailing list
>>>         LLVMdev at cs.uiuc.edu  <mailto:LLVMdev at cs.uiuc.edu>          http://llvm.cs.uiuc.edu
>>>         http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>>         -- 
>>         John Criswell
>>         Assistant Professor
>>         Department of Computer Science, University of Rochester
>>         http://www.cs.rochester.edu/u/criswell
>>
>>
>
>
>     -- 
>     John Criswell
>     Assistant Professor
>     Department of Computer Science, University of Rochester
>     http://www.cs.rochester.edu/u/criswell
>
>

-- 
John Criswell
Assistant Professor
Department of Computer Science, University of Rochester
http://www.cs.rochester.edu/u/criswell

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150507/d384884f/attachment.html>