[PATCH] D150446: [analyzer] Check ArraySubscriptExprs in ArrayBoundCheckerV2

Thu May 25 18:31:27 PDT 2023

NoQ added a comment.

> By the way, I'm fed up with the hack that ElementRegion is used for three separate things ("real" array indexing, casts and pointer arithmetic). To fix this I'm thinking about introducing a subclass hierarchy where a base class `ElementLikeRegion` has three subclasses:
>
> - `ElementRegion` represents the smaller memory area of one element in an array,
> - `CastRegion` represents the same memory area, but with a different type, and
> - `OffsetRegion` represents the same memory area, but with a different starting point.
>
> Most old references to ElementRegion could be replaced by references to ElementLikeRegion, but functions like `stripCasts()` would be able to distinguish between the subclasses and do the intuitive thing.
>
> What do you think about this idea? Do you see any problem with it?
>
> (By the way I'm not satisfied with these quickly picked class names -- feel free to suggest better ones!)

My usual response to this is, "region types were a mistake". Every single time when the program relies on region type, it is actually supposed to rely on either AST type behind the program point, or on the dynamic type map if runtime types are of interest. I believe the ideal solution is to remove all non-base regions (`ElementRegion`, `FieldRegion`, `CXXBaseObjectRegion`, `CXXDerivedObjectRegion`) entirely, and instead represent all single-location values as (//base region//, //offset//) pairs (where //offset// can be an arbitrary `SVal` of appropriate numeric type), and represent regions as (//base region//, //begin offset//, //end offset//) triples.

So for example the `SymbolRegionValue{FieldRegion{VarRegion}}` object `reg_$0<int var->field>` would be turned into a `SymbolRegionValue{VarRegion, Offset}` object `reg_$0<int var @+32>` where `32` is the bit offset of field `field` in the struct. We could preserve the end offset here as well, but the `int` part already tells us that it's `64` or something like that. Funny type punning should probably be handled by the surrounding expression instead; here we just need to capture what the memory read looked like when it produced that symbol.

Basically do whatever `RegionStore` already does right, everywhere else.

I have more thoughts/examples at https://github.com/llvm/llvm-project/issues/42709

I don't know whether my solution is easier to implement than your solution. My solution yields a much simpler system at the end (whereas your solution increases complexity), but in my case the changes can be pretty dramatic. Both solutions will be pretty hard because you'll have to untangle a lot of code built specifically to deal with the existing artificial ambiguities. One good thing about my solution is, you'd at least be able to consult runtime behavior as the ultimate source of truth; region types aren't part of the runtime behavior so there's no right or wrong way to handle them so they'll always be a source of confusion. But if I was to guess, I'd say your solution is probably more realistic to implement on a short time frame, and you might be able to make your work somewhat incremental as well, by introducing the new regions in more and more cases.

> I'm replacing it with an implementation based on `check::PreStmt<ArraySubscriptExpr>`

I think this is a great call; there's no fundamental reason why this would be more precise, but this tightly ties the problem to source code (as opposed to low-level memory layout), which allows us to shape the entire analysis differently and communicate with the user more efficiently.

One straightforward implication is that this way we avoid dealing with multiplication/division when calculating offsets. This is particularly valuable when the offset is symbolic, so existing decomposition into `RegionOffset` basically gives up. Of course we could fix `RegionOffset` to correctly compute a symbolic offset value instead, with all the multiplications and divisions over symbolic expressions. Then the constraint solver will need to solve these multiplications/divisions. Of course we could improve the constraint solver to handle them better, but if we can avoid this entirely, why not? Then, after the solver gives us the answer, we have to explain the answer to the user. And then, again, this isn't unsolvable, but that explanation would be much more convoluted in terms of symbolic byte offsets, than in terms of the actual source-level array index.

Now, symbolic offsets are actually extremely important because I believe this is pretty much the only case where we can reach good false positive rate. A typical loop over an array will be modeled by assigning the loop counter to concrete values `0`, `1`, `2`, `3` and simulating the loop. This is the root cause of many false positives because the assumption that the loop can definitely be executed exactly `0` (or exactly `1` or exactly `2` or exactly `3`) times is deeply incorrect. However when the index is symbolic, it indicates that //the index is not a loop counter//, so these false positives aren't going to hurt us.

Finally, when it comes to reclaiming our chance of finding bugs in loops, I think you should try to take your patch to the next level: don't even warn at `check::PreStmt<ArraySubscriptExpr>`, warn at `checkPreStmt<ForStmt>`!! Where `check::PreStmt<ForStmt>` is not actually a thing we support yet, but you get the point, simulate it with `check::BranchCondition` or something. The point is to analyze the loop ahead of time, syntactically or in a flow-sensitive manner (just like the loop unrolling facility figures out if a loop is worth unrolling, or how clang-tidy's `bugprone-infinite-loop` finds infinite loops), identify access patterns to the array in the loop, reach a conclusion that the array will be accessed by the loop up to a certain bound, compute symbolic value of that bound, and verify that it's within array bounds //immediately//, before entering the loop! In this case it doesn't matter how imprecise our loop modeling is, you'll still be able to reduce the concrete loop unrolling problem into a symbolic constraint problem, which we have a much better chance of solving correctly beyond toy examples.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150446/new/

https://reviews.llvm.org/D150446