[clang] [clang][dataflow] Treat comma operator correctly in `getResultObjectLocation()`. (PR #78427)

Fri Jan 19 00:21:58 PST 2024

martinboehme wrote:

> This change itself looks good to me, but I start to doubt whether we actually need `getResultObjectLocation` the way it is currently implemented. One of the red flags for me seeing some `getResultObjectLocation` calls in `UncheckedOptionalAccessModel.cpp`. I think most (all?) of those calls are redundant, the type of the expression already guarantees that we have the right location. Moreover, I think the propagation should happen internally in the built-in transfers, so authors of user code like `UncheckedOptionalAccessModel` should never need to think about these implementation details.
> 
> I think it might make sense to review all of the calls and after we removed the redundant ones, we might want to reconsider if it makes sense to restrict the use of this function somewhat.
> 
> Let me know if I am missing something.

The missing context here, I think, is that the current implementation of `getResultObjectLocation()` is deficient and I want to fix it (hopefully soon). This fix should be transparent to the existing callers of `getResultObjectLocation()` (but it does mean that we currently call `getResultObjectLocation()` in places where it looks as if we could do something simpler).

Let me expand. Currently, `getResultObjectLocation()` propagates the location from the "original record constructor" up towards any prvalue ancestors of the original record constructor. The location for the "original record constructor" is simply pulled out of thin air by the transfer function for that node.

But this is wrong. For example, consider the following code:

```cxx
  A a = some_condition()? A(1) : A(2, 3);
```

When either the `A(1)` or `A(2, 3)` constructor are called, the `this` pointer that is passed to them should be `&a`. But our current implementation doesn't do this. When visiting `A(1)` and `A(2, 3)`, it independently creates storage locations for them ("out of thin air") and tries to propagate them upwards (towards the root). This poses an unsolvable problem when processing the conditional operator; the "result object location" for the conditional operator should be identical to the result object location for both branches, but this is currently impossible to achieve. Instead, we currently don't really model the conditional operator at all -- we just invent a fresh location for it, and hence we claim that the conditional operator is an "original record constructor". See also [this comment](https://github.com/llvm/llvm-project/blob/c6a6547798ca641b985456997cdf986bb99b0707/clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp#L751).

What we should instead be doing is propagating the result object location _downwards_ from the result object to any prvalues that might initialize it. This is described in a FIXME [here](https://github.com/llvm/llvm-project/blob/c6a6547798ca641b985456997cdf986bb99b0707/clang/include/clang/Analysis/FlowSensitive/DataflowEnvironment.h#L332). (By the way, this is what CodeGen does too -- not surprisingly.)

The current implementation is also terrible in other ways. We do actually get the following right:

```cxx
  A a = A(1);
```

If you query `getResultObjectLocation()` for `A(1)`, you'll find it gives you the same location that you get if we call `Environment::getStorageLocation()` for `a`. But the way we arrive here is a terrible hack: When processing the `VarDecl` for `a`, we retrieve the `RecordValue` for its initializer, look at its `RecordValue::getLoc()`, and "steal" this location as the location to use for the storage of the variable. But we can't actually always do this stealing; for example, when processing a member initializer, the location for the member variable has already been determined, and so we need to [perform a copy](https://github.com/llvm/llvm-project/blob/3b54337be54107a5f196bb98dfc858b10ec35b8a/clang/lib/Analysis/FlowSensitive/TypeErasedDataflowAnalysis.cpp#L419) that should not actually be happening.

A nice side effect of fixing `getResultObjectLocation()` to do the right thing is that it will make `RecordValue::getLoc()` obsolete -- and because this is the last remaining reason for `RecordValue` to exist at all, this will allow us to eliminate `RecordValue` entirely, something I've been working towards for a while (see for example [this discussion](https://reviews.llvm.org/D155204#inline-1503204)).

Getting back to your original point: Yes, many of the existing `getResultObjectLocation()` calls are actually redundant -- they could be replaced with code that retrieves the `RecordValue`, then obtains the location from `RecordValue::getLoc()`. But the `getResultObjectLocation()` calls future-proof the code -- they will allow the code to continue to work when we change `getResultObjectLocation()` to propagate locations from result objects down towards the "original record constructors", rather than upwards as we do today.

This has been pretty lengthy, but I hope it gives more context.

https://github.com/llvm/llvm-project/pull/78427