[LLVMdev] Google Summer of Code Idea

Fri Mar 7 19:16:08 PST 2008

FWIW, Ben Hardekopf asked me to post this for him on the subject

Regarding Whaley and Lam's BDD-based approach to context-sensitive
analysis, there is a paper by the same research group (Avots et al in
ICSE'05) that adapts their technique for C. The results are not very
encouraging -- their system doesn't scale nearly as well for C as it did
for Java. The primary difference is that unlike Java, in C top-level
variables can have their address taken and be passed between functions,
which requires the analysis to maintain 2 contexts for each points-to
relation: one for the pointer and one for the pointee.

The other problem with their method is that only the top-level variables
are treated context-sensitively; everything in the heap (e.g. for Java,
all object fields) is context-insensitive. I spoke with John Whaley about
this and he said they've tried to use a context-sensitive heap model, but
they couldn't get it to scale at all.

My personal feeling (backed up by results from some other research groups,
e.g. Chen et al in ISPAN'02 and Nystrom et al in PASTE'04) is that a
context-sensitive heap model is critical for getting the maximum benefit
from a context-sensitive pointer analysis, and I don't know that BDDs, at
least as currently used, are the right approach for this.

I was impressed by a completely different approach, described by Nystrom
et al in SAS'04 and Nystrom's PhD thesis, for doing context-sensitive
pointer analysis for C with a context-sensitive heap model. Like Whaley
and Lam, and unlike Lattner and Adve's Data Structure Analysis, their
technique is inclusion-based (i.e. a context-sensitive version of
Andersen's), but unlike Whaley and Lam it also uses a context-sensitive
heap model, and it's targeted at C. They managed to scale the analysis to
a couple hundred thousand lines of code. I actually have several ideas on
how to improve their analysis and make it more scalable, but I haven't had
time to implement them -- mainly because I haven't wanted to re-implement
their whole analysis in LLVM. I would definitely be interested in any
effort to do this, and I think it would have a lot of benefit. In fact, I
supervised another student who implemented a simplified version of this
analysis for a class project; while the code itself probably isn't
re-usable, the student made a number of good observations on the analysis
that I would be happy to share with anyone who is interested.

Thanks,

Ben