[cfe-dev] Dataflow analysis with LLVM/Clang

Thu Oct 2 04:05:23 PDT 2008

On Wed, Oct 1, 2008 at 10:19 PM, Ted Kremenek <kremenek at apple.com> wrote:
> I think Mike's comments are pretty much spot on.  To me this really amounts
> to listing out your requirements and what you are trying to accomplish.

Sorry guys, re-reading my email I agree I wasn't clear enough. To be
extremely honest, the supervisor of this project also didn't give much
details about the project. I doing this as an "research experience" on
the Technical University of Madrid, but actually I'm a MSc student at
University of Campinas (Brazil), so I just got here and just got in
touch with the project. A asked him some questions and now have more
details about what I need to do. I also don't want to go too OT, but
I'll try to explain a bit about what is the intended use of this info.

This analysis is going to be part of a project that tries to obtain a
deterministic behavior from multi-threaded replicas of a replicated
server (since determinism is one of the requirements for active
replication). Assuming all the requests arrive at the same order at
all the replicas (what can be achieved with a total-ordered
total-reliable multicast infrastructure) there is a scheduler, which
creates a new thread to handle each requisition and controls the
scheduling of the threads (it's implemented as an API with sleeps over
the thread API). This scheduler needs the info about about what is the
next conflictive action each thread will perform, so it can enforce an
order of scheduling between the conflictive actions, which is going to
be the same at all replicas. The info is going to be sent by each
thread through an API of the scheduler (something like
notify(lock_a)), and it needs to be sent as early as possible so the
scheduler can have info about the future actions of each thread.

>  From a high-level, it sounds like what you want to do is program
> transformation.  If the goal of the transformation is to change runtime
> behavior, then you can perform the transformation at either the LLVM IR
> level or by rewriting source code using Clang.  If the goal is to modify the
> original source code so that users now are working with an instrumented
> source file, then obviously this has to be done using Clang.

The final product has to be instrumented C source code, that can be
compiled with any C compiler. The insertions has to be easily
verifiable, and it seems to me that the easiest way to achieve that is
to make the insertions over the original C code. I've read that there
is a LLVM backend that translates LLVM to C, but I'm not sure how
similar to the original program is the code generated (I think it's
not intended to recall the original program).

> I'm going to assume that your goal is simply to modify runtime behavior.  If
> that is the case, my gut feeling is that it is better to do it at the LLVM
> IR level if you really don't require any specific knowledge about C.  The
> lowered representation of the LLVM IR marginalizes out details of the
> high-level language that may be really superfluous for your task; C is a
> "rich" language with many constructs, so your analysis would have to reason
> about many edge cases.  There are many other tradeoffs that we can go into
> if you are interested.

I can make assumptions to simplify the analysis, such as there are no
typedefs or macros on the program, and then try to relax this
constraints. The idea at first is to make a proof-of-concept, which
just handles pthread_mutex_lock() and pthread_mutex_unlock(). Then
latter I can try to make it deal with more generic C code.

> I think the other thing to keep in mind is how the concurrency primitives
> whose uses you are interested in monitoring are represented both in Clang's
> AST and the LLVM IR.  If you can easily identify when such primitives are
> used at the LLVM IR level, then doing your transformations there makes the
> most amount of sense to me (given the information I know about what you are
> trying to do).

The definition of a conflictive action needs to be flexible. At first,
I'll look for statements like pthread_mutex_lock() and
pthread_mutex_unlock(), but maybe for some reason I can have a program
where the conflictive action would be foo(), so I'll need to be able
to configure my tool (maybe through a config file) for which primitive
to look for. Maybe clang would be more flexible at this point, since I
imagine (and please correct me here if I'm wrong) that each statement
at clang's AST it's similar to it's form at the source code. If using
LLVM I imagine I'll have to figure out for each new type of
conflictive action that my appear it's representation on the LLVM IR.

> I'm not 100% certain how you wanted to use line information.  Certainly
> Clang has rich information about the locations of expressions within a
> source file, but LLVM IR can capture some debugging information that may be
> useful for constructing the line information you need (others can chime in
> here, since I'm not an expert on this topic).  It all depends on what you
> are trying to do.

The line info would be used to know exactly at what point of the
source file the code insertion should be made. Maybe it's enough to
insert statements on clang's AST, and I will not need to handle this
by hand. Or maybe I can do the analysis and figure out the
transformations over LLVM, and latter use line info to modify the
source code by hand.

Right now it seems to me that for what I need clang it's more
appropriated, but I still not completely sure. I would also like to
hear any comments about if this kind of work can be useful to
LLVM/Clang, because if it's similar to something you guys need, I can
try to write it in such a way that it can be extended or modified to
suits your needs.

-- 
João Paulo Rechi Vita
MSc Computer Science Student
Computer Engineer
IC / Unicamp
http://jprvita.wordpress.com/