[LLVMdev] llvm DSA - reproduce the result in PLDI 07 paper

Wed May 6 20:07:28 PDT 2015

Dear John,
Thank you so much for your reply!
Please kindly correct me if my understanding is wrong.

>From your reply, I learn that there could be mainly four reasons why DSA's
precision is low:
1. type inference
2. influence of external code
3. bug in handling function pointers
4. using the wrong DSA pass

As you mentioned in your previous mail, LLVM changes a lot.
So I was wondering if DSA's algorithm (presented in PLDI '07) can achieve a
comparable precision on top of LLVM 3.5+ after implementing the
improvements.

Also, I have diff the DSA code in poolalloc release_19 and release_32.  It
seems to me that the changes in DSA mainly comes from the aspects as below:
1. release_32 introduces the DataLayout to do some type inference;
2. release_32 replaces some data structures with LLVM built-in data
structures;
3. release_32 handles different types of LLVM IR instructions, and in a
different way.
4. the inheritance of Analysis Passes has changed a bit.

Therefore, I was wondering if the introduction of DataLayout and changes in
LLVM IR cause the imprecision as well.

Thank you!

Regards,

Zhiyuan Wan

On Wed, May 6, 2015 at 10:35 AM, John Criswell <jtcriswel at gmail.com> wrote:

>  On 5/5/15 5:33 PM, Wan Zhiyuan wrote:
>
> Dear John,
> I intend to implement the improvements on DSA.
>
>
> There be nasty dragons in DSA.  Don't say I didn't warn you.
> :)
>
> On a side note, I recently discovered that someone (I think Will Dietz)
> updated the code to work with LLVM mainline, so you should be able to
> update to a newer version of LLVM and use DSA.  I've created a snapshot of
> the LLVM source tree and a modified version of the poolalloc tree in
> https://github.com/jtcriswell/llvm-dsa that you may find useful.  I think
> Will is also maintaining a github tree that he updates regularly.
>
>  After running DSA on SPEC, I found DSA gives low precision for mcf and
> bzip2.
> I have checked the most imprecise c files in mcf an found that the code
> seems to be a mixture of "PHI" and "GEP" instructions.
>
>  Could you please give me some hints about what the big picture of the
> improvement should be and how to start?
>
>
> There may be many reasons why DSA's precision is low.  You'll need to
> figure out what the reason is for each program.  In some cases, it may be
> type inference.  In other cases, it may be that DSA is cognizant of the
> influence of external code (which causes it to correctly give pessimistic
> results).  For programs using function pointers, there's a bug in the
> algorithm in which DSNodes that should be marked complete are not.  It
> could also be that you're using the wrong DSA pass (e.g., using EQTD when
> TD will do).
>
> Looking at type inference specifically, the problem, in a nutshell, is
> that DSA's type inference should not rely upon LLVM's type annotations.  It
> should just create a map from offsets to types.  Some work on this has
> already been done (e.g., DSA can figure out that casting a pointer to a
> structure into a pointer to the type of the structure's first field is
> still offset 0).  However, there are still places in which DSA is relying
> upon LLVM's type annotations.
>
> One thing that I would like to look at is changing how DSA analyzes arrays
> of structures.  Right now, DSA tries to infer a structure type for a memory
> object and then tries to infer whether the memory object is a singleton
> structure or an array of structures (you can see this in DSA's interface;
> you can see a map between offsets and types and an 'A' flag indicating that
> the object is an array).  I think this makes DSA needlessly complicated.
>
> I think it would be better if DSA did what Value Set Analysis (VSA) does.
> VSA was designed to analyze untyped binary code.  For each abstract memory
> object, it creates a map of 4-tuples to types.  Each 4-tuple represents a
> formula ax+b as (a,b,c,d) in which b if offset, a is stride, and x is
> constrained between values c and d (c and d can be constants or
> +-infinity).  For example, if you have an array of struct {int a ; char *
> b} on a 32-bit machine, DSA currently tries to figure out that there's an
> array of structure elements in which there's an int at offset 0 and a char
> * at offset 4 within the structure.  VSA would say that there's a memory
> object with an int at every multiple of 4x+0 offset and a char * at every
> 4x+4 offset.
>
> The VSA approach should be agnostic to all sorts of weird casting,
> embedded arrays, and embedded unions, though this is an educated guess at
> present.
>
> Regards,
>
> John Criswell
>
>
>
>  Thank you!
>
>  Regards,
> Zhiyuan
>
> On Mon, Apr 6, 2015 at 5:22 PM, John Criswell <jtcriswel at gmail.com> wrote:
>
>>  Dear Zhiyuan,
>>
>> In order to reproduce the results from the paper, you'll need to
>> replicate a system from that era.  You'll need to use the same version of
>> LLVM and DSA that the paper used.  I think that was LLVM 1.9 (the
>> release_19 branch of LLVM and poolalloc), but I'm not sure.  You should
>> check to see if the paper specifies the version.
>>
>> As you'll be using a very old version of LLVM, it may be worth setting up
>> a VM with a corresponding old version of Linux.  I suspect newer compilers
>> will not compile a version of LLVM that is that old, so using an old
>> version of Linux with an old version of GCC may be needed.  I think Fedora
>> Core 2 is the OS we were using at the time.
>>
>> To answer the question of why you can't use a modern version of LLVM and
>> poolalloc, it's because LLVM has changed significantly.  DSA relies upon
>> the type annotations provided in the LLVM IR to "bootstrap" its type
>> inference (bootstrap is not quite the right word, but it's the closest one
>> of which I could think).  As LLVM matured, transformations would ditch the
>> type information (e.g., transforming typed GEPs into untyped GEPs into a
>> byte array), making DSA's ability to do type-inference (and thereby
>> improving field sensitivity) more difficult.  Throw into the mix the fact
>> that DSA is maintained by an academic research group, and the result is
>> that modern DSA doesn't have the accuracy that the original DSA did.
>>
>> The good news is that I think DSA can be fixed by making its
>> type-inferencing code smarter.  The bad news is that it'd be a fair amount
>> of work to do.  So far, no one has had sufficient desire/motivation to
>> design and implement the improvements.
>>
>> Regards,
>>
>> John Criswell
>>
>>
>>
>> On 4/6/15 4:56 PM, Wan Zhiyuan wrote:
>>
>>  Dear all,
>> I am trying to reproduce the "Percent May Alias" result described in PLDI
>> 07's paper "Making Context-Sensitive Points-to Analysis with Heap Cloning
>> Practical For The Real World" (
>> http://llvm.org/pubs/2007-06-10-PLDI-DSA.html).
>>
>>  However, my "Percent May Alias" for all the benchmarks is much greater,
>> especially "bzip2".
>>
>>  The DSA code I use is from "https://github.com/smackers/smack". I have
>> diff the code between smack and poolalloc_32. They are almost the same
>> except the "#include" statements.
>>
>>  I was wondering whether I need to do some configuration to make DSA
>> work properly.
>>
>>  Thank you!
>>
>>  Zhiyuan
>>
>>
>>
>>
>>
>>  _______________________________________________
>> LLVM Developers mailing listLLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.eduhttp://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>>
>> --
>> John Criswell
>> Assistant Professor
>> Department of Computer Science, University of Rochesterhttp://www.cs.rochester.edu/u/criswell
>>
>>
>
>
> --
> John Criswell
> Assistant Professor
> Department of Computer Science, University of Rochesterhttp://www.cs.rochester.edu/u/criswell
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150506/585fe91c/attachment.html>