[cfe-dev] MemRegions - how to (re)use them right?

Mon Oct 19 07:30:10 PDT 2009

Hi Ted,

it's been a while since I asked about the topic hence I appended (aka 
top-post) our conversation so far.
First, I think I need a basic flow-sensitive dataflow analysis only. 
That is, at least the code block of interest can be considered to be a 
basic block. However it might be of course embedded in for-stmts or the 
like. I looked at the graphs produced by GRExprEngine (cmd-line was 
clang-cc -analyzer-viz-egraph-graphviz -checker-cfref -analyze 
-analyzer-store=basic -analyzer-constraints=basic) but I found the 
output not very useful in the presence of loops - it seems that the 
analyzer performs some sort of simulation of the program. The graph 
contains the loop body several times (every time the index was counted 
up). In general I think GRExprEngine does too much for my purpose (at 
least at the moment).
Meanwhile I 'm able to further refine my requirements. Assume the 
following code:

void foo()
{
  for (/*...*/)  // a loop is always involved
  {
    lhs_1 = rhs_1;
    lhs_2 = rhs_2;
    //aso.
  }
}

At block level the for-body contains binary assignment expressions only. 
Now I just want to know, if and where lhs_x appears anywhere in rhs_y 
(with x < y, let's ignore x > y for the moment). Then I could expand 
rhs_y by replacing lhs_x with rhs_x, perhaps recursively. This of course 
isn't that easy as it sounds: you have to resolve things like *&a <-> a 
or a->b <-> (*a).b.
Thus all I need is a unique identifier for lhs_x and a function which 
for a given arbitrary rvalue-expression finds the referred lhs_x 
identifier (if one exists). Given these my initial example could be 
solved quite elegant:

A* ptr = &temp;                  // var ptr gets ID1
temp.a = /*expr*/;               // field var temp.a gets ID2
temp.b = /* another_expr */;     // field var temp.b gets ID3
*sink = temp.a;   // function detects ID2 for temp.a: expand to *sink = 
/*expr*/: done
*sink = ptr->b;   // function detects ID1 for ptr: expand to *sink = 
(&temp)->b:
                  // function detects ID3 for (&temp)->b: expand to 
*sink  = /* another_expr */: done

I think, that one feasible approach to do this is working on the AST 
level by unifiying expressions somehow (e.g. something like a unified 
string) and then just search for them in other unified expressions. 
However I'm afraid that this approach isn't very extensible. At least it 
means some work which possibly has been done.
OTOH I've seen the MemRegions which apparently are already designed so 
that they can act easily as these unique identifiers I mentioned above. 
Also the PostLoad node seems to designate sub-expressions for which a 
MemRegion exists. But the only clang part dealing with MemRegions so far 
is the GRExprEngine and as I said I'm note sure if it is what I need. Or 
it's me who is still not able to use GRExprEngine properly for my purpose.

What do you think? Are this or similar problems already solved, maybe by 
others?
Thanks for any help.

Best Regards
Olaf Krzikalla

Ted Kremenek schrieb:
> Hi Olaf,
>
> The short answer is "yes this can be done," and others have voiced 
> interest in doing taint analysis.
>
> At a high-level, do you want to do basic flow-sensitive dataflow 
> analysis or path-sensitive analysis?
>
> MemRegions are used by the path-sensitive dataflow engine 
> (GRExprEngine) to do a lazy abstraction of memory.  They are used in 
> combination with SVals (see SVals.h) to reason about the values of 
> expressions along a given path.  They aren't used just for 
> diagnostics, but are used to precisely reason about memory and memory 
> bindings (e.g., what value binds to a variable).
>
> If you wanted to do this using path-sensitive analysis, one way to do 
> this is to walk the ExplodedGraph produced by GRExprEngine.  The 
> ExplodedGraph represents the possible paths traced within a function, 
> and each node consists of a program location (e.g., an expression) and 
> a program state.  Specific ExplodedNodes have locations that represent 
> loads/stores.  You could "simply" walk the graph (using DFS) looking 
> for loads, and then trace where individual expression values are 
> passed to stores.  Each store consists of a location the value will be 
> stored and the value to be stored.  After a store, you keep traversing 
> the ExplodedGraph until you see a load from the same location, then 
> trace that value, etc.  There are many details to get this right, but 
> it could be done, and you wouldn't need to modify the analyzer at all.
>
> This option could be generally useful for several clients interested 
> in "taint" analysis.  Down the line we could build a simpler interface 
> to this information that layers on top of the ExplodedGraph, and have 
> the underlying machinery do all the hard work.  For example, the 
> interface would support the query that for a given Expr evaluated at 
> an ExplodedNode, what is the immediate set of preceding ExplodedNodes 
> (and corresponding Exprs) from which the value of that Expr is 
> derived.  This relation in itself is a graph that could be walked, but 
> it is at a little higher level than just walking the ExplodedGraph 
> directly.
>
> Another option is for us to build support for general "taint" tracking 
> into GRExprEngine and friends, with core transfer function logic doing 
> some of the taint propagation.  This might be useful anyway, but would 
> add more complexity not required by all clients.  It also might reduce 
> the amount of path caching done by the analyzer.
>
> On Sep 29, 2009, at 7:50 AM, Olaf Krzikalla wrote:
>
>> Hi @clang,
>>
>> I need to analyze some (rather simple) data flow at AST level. While I
>> could do it by my own I have the strange feeling that the
>> MemRegionManager already provides a lot of the means I need. However I
>> couldn't figure out how to use it. At the moment it seems to be used for
>> diagnostics only.
>> Here is a code example:
>>
>> struct A { int a, b };
>>
>> void foo(int* sink)
>> {
>>  A temp;
>>  A* ptr = &temp;
>>
>>  temp.a = /*expr*/;
>>  temp.b = /* another_expr */;
>>  *sink = temp.a;   // (1)
>>  *sink = ptr->b;   // (2)
>> }
>>
>> I want to know that at (1) actually the result of expr is written to
>> sink and that at (2) the result of another_expr is written to sink. Can
>> I somehow compute this using clang's static analyzer?
>>
>>
>> Best regards
>> Olaf Krzikalla