[PATCH] D116368: [clang][dataflow] Add transfer function for VarDecl statements

Fri Dec 31 09:41:28 PST 2021

xazax.hun added inline comments.

================
Comment at: clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp:88
+    for (const FieldDecl *Field : Type->getAsRecordDecl()->fields()) {
+      FieldLocs.insert({Field, &createStorageLocation(Field->getType())});
+    }
----------------
sgatev wrote:
> xazax.hun wrote:
> > sgatev wrote:
> > > xazax.hun wrote:
> > > > Could this end up creating an overly large state? There might be objects with quite a lot fields but each function would only access a small subset of those. Alternatively, we could attempt to create the representations for fields on demand (this is the approach what the clang static analyzer is using). 
> > > That's a great point, however I don't think initialization on demand plays well with the dataflow algorithm. For example:
> > > 
> > > ```
> > > struct S {
> > >   int Val;
> > > };
> > > 
> > > void target(bool b) {
> > >   // basic block 1:
> > >   S Foo;
> > >   int Bar;
> > >   if (b) {
> > >     // basic block 2:
> > >     Bar = Foo.Val;
> > >   } else {
> > >     // basic block 3:
> > >     Bar = Foo.Val;
> > >   }
> > >   // basic block 4:
> > >   ...
> > > }
> > > ```
> > > In basic block 4 we should be able to infer that the value that is assigned to the storage location for `Bar` is unambiguous. However, since `Foo.Value` isn't created in basic block 1, this means that it's not inherited by basic block 2 and basic block 3. Each of these blocks will end up creating distinct values to assign to the storage location for `Foo.Value` and so in basic block 4 the value for `Bar` will end up being ambiguous.
> > > 
> > > Alternatively, we can do a pass over all statements in all basic blocks before running the analysis to identify which fields are used in the function. We can use this information here to initialize only parts that are being used.
> > > 
> > > What do you think?
> > I am not convinced about this example. I think it should not matter where and when we create the location for Foo.Val, it always should be equivalent to the others, i.e., they should have the same identity regardless their creation. Basically, I think the identity of such a location should be determined by the base object (Foo in this case), and the access path to the subobject. 
> > 
> > This would be a non-trivial change, so I'm ok with deferring this discussion to some follow-up PRs. But I'd love to see some alternatives explored because eagerly evaluating all fields used to bite me in the past in other static analysis tools performance-wise.
> Right. I agree with the comment about storage locations. The issue I see has to do with values that are assigned to the respective storage locations. In general, different values can be assigned to the same storage location in different basic blocks. I don't see a straightforward way to incorporate on-demand initialization while preserving the behavior with respect to values. Perhaps we can achieve this by storing references to parent environments in `Environment` instead of merging their `LocToVal` maps when creating environments for successor basic blocks, but this patch is already pretty big so I'd rather defer that. I do agree that this is an important problem we should solve so I added a FIXME to remind us.
> The issue I see has to do with values that are assigned to the respective storage locations. 

Oh, I see. Generally, `Foo.Val` could be the initial value (representing the lack of knowledge) in both basic blocks, so merging them would be fine. But we potentially want to assign symbols to the unknown values to be able to describe flow constraints and solve them. So the problem is, we want `Foo.Val` to evaluate to the same unknown symbol in both basic blocks, so we know the values at `Foo.Val` are actually the same in those branches. 

While it might not be straightforward, I think there are ways around this problem. As far as I understand, from the values' point of view the only problematic case is when we access a value for the first time. We could either derive the unknown symbols from the location to ensure that we initialize `Foo.Val` to the same symbol in both cases (so the identity of the unknown symbol would describe the fact that this symbols is the initial value for that location). Alternatively, we could have a separate mapping for the initial values and use it to look the right symbol up. The problematic case is the following:

```
struct S {
  int Val;
};

void invalidate(S&);

void target(bool b) {
  // basic block 1:
  S Foo;
  int Bar;
  if (b) {
    // basic block 2:
    Bar = Foo.Val;
  } else {
    // basic block 3:
    invalidate(S);
    Bar = Foo.Val;
  }
  // basic block 4:
  ...
}
```

Whatever we come up with, we should ensure that Bar will **not** have the same value in block 2 and 3 in this case.

> Alternatively, we can do a pass over all statements in all basic blocks before running the analysis to identify which fields are used in the function.

This is not an ideal solution. During the fixed-point iteration we might have more information, e.g. we could resolve some of the pointers/references. So in most of the cases creating `Location`s for fields on demand could be more precise than doing them upfront.

> The issue I see has to do with values that are assigned to the respective storage locations. 

I agree, we should not block this patch and this is better solved separately.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D116368/new/

https://reviews.llvm.org/D116368