[clang] [analyzer] Enhance array bound checking for `ConstantArrayType` (PR #159357)

Fri Sep 19 00:13:00 PDT 2025

alejandro-alvarez-sonarsource wrote:

> @alejandro-alvarez-sonarsource @steakhal What do you think about the code duplication situation? What are your reasons for proposing this implementation?

I tend to be conservative about existing behavior because

1) It is working!
2) We have been relying on it for long, and even a small divergence could turn into hundreds, if not thousands, of new positives and negatives coming and going on existing projects (changes that unfortunately are not caught by lit tests)
3) Not to mention a slight change in assumptions can cascade to other CSA checkers (both up and downstream) adding and/or removing another big set of raised issues.

Of course this is not news for you!, but this is why I have preferred to keep the surface of the changes small (at the cost of duplication).

Trade-offs, though, and I agree that I have probably lean too far into duplication this time.

> If I understand correctly the "new" type-based bounds checking differs from the "old" region/extent-based logic in the following areas:
> 
> * it performs calculations with an index instead of a byte offset,
> * it uses a different logic to calculate the array size (which will be compared with the index),
> * in multidimensional arrays (and probably other similar situations?) it has a different (more strict) understanding about the beginning of the array,

Correct. It relies purely on the type, even if we know little or nothing about the memory region.

> * it reports slightly different messages (I admit that I didn't analyze the differences in detail).

TBH this started a side effect of using indexes (so now "index" is used instead of "offset" in some issues), because I wanted to stay close to existing reports. Some details like using "subarray" to be more specific came later since now we could do it.

> To achieve these differences the current version of the PR duplicates the following components of the code:
> 
> * the "main method" `performCheck` / `performCheckArrayTypeIndex`,
> * the helper class `StateUpdateReporter` / `StateIndexUpdateReporter`,
> * the various helper functions that produce the `Messages`.
> 
> Among these my rough impression is that the state update handler and message generation helpers are relatively straightforward and they could be unified by introducing a few well-placed additional parameters, boolean flags etc. (or even a BaseStateUpdateReporter class with two subclasses that implement slightly different behavior).

As a first approximation, this is where most of the duplication could be removed, yes.

> In the "main" methods I see that the "old" and "new" approaches begin with completely different logic for calculating the index/offset value, the base region (i.e. the "full array" to which the accessed location is compared) and the array size. However, after that I find it very unfortunate that lots of complex logic (the deeply nested `if`s that check underflow and then overflow, plus past-the-end expressions, taint etc.) is duplicated between the two variants. I hope that it would be possible to factor out a "check whether this index/offset is between zero and this extent value" method which could be shared between the "new" and "old" checks.

Makes sense.

> I see that there are some additional differences in this latter part of the "main" `performCheck` methods, but I feel that these are not _neccessary_ differences:
> 
> * I would guess that the new heuristics for better fake-FAM handling would be valuable in the old code as well (although `getDynamicExtent` has some flexible array member heuristics);
> * I don't immediately see why can the new variant omit my "isObviouslyNonnegative" hack (and if you do have a better solution, then I would be very happy to see it applied in the "old" check as well).

I didn't quite omit it (because it's great at suppressing FP). `getAsCleanArraySubscriptExpr` needs memory regions to be laid out in a particular way ("not modified by pointer arithmetic)", and since the new logic doesn't look at the memory region, but at the type as-is in the AST, I inlined the `E->getIdx()->getType()->isUnsignedIntegerOrEnumerationType()` check.

> Moreover I feel that we could and perhaps should reduce the amount of duplicated code by **removing some parts of the old logic** and keeping only the new approach:

I have to admit this didn't cross my mind.

> * Perhaps we don't need the complex old logic that handles multidimensional arrays by calculating the offset from the beginning of the whole array (by combining multiple `ElementRegion`s) -- IIUC that logic was introduced when this checker was still using `check::Location` and `check::Bind` callbacks that provided less information than the current callbacks.
> * Perhaps we could switch to using indices instead of byte offsets by default -- because this would produce simpler symbolic expressions that would be "more useful" for the rest of the analyzer (e.g. instead of assuming `4 * x < 40` we could assume `x < 10` which can also occur as e.g. an `if` condition).

Can we? How about cases such as this lit test?

```cpp
struct two_bytes convertedArray2(void) {
  // We report this with byte offsets because the offset is not divisible by the element size.
  struct two_bytes a = {0, 0};
  char *p = (char*)&a;
  return *((struct two_bytes*)(p + 7));
  // expected-warning at -1 {{Out of bound access to memory after the end of 'a'}}
  // expected-note at -2 {{Access of 'a' at byte offset 7, while it holds only 2 bytes}}
}
```

> 
> In this review I'll try to be as constructive as possible and I'm open to investing work into refactoring this checker to get an end result that provides the features that you need with as little code duplication as possible. (Edit: I can also pragmatically accept some code duplication if that yields the overall best code.)
> 
> Note that I'm also planning significant changes in this file: I'm planning to reimplement `alpha.security.ReturnPtrRange` as a separate `CheckerFrontend` of `ArrayBound` (which would become a checker family) because `ReturnPtrRange` and `ArrayBound` check the exactly same thing (is this location in bounds?) in two different situations (returned pointer value vs accessed location). (In fact, I started to work on this change two days ago, but I'll probably need to restart that work because this PR affects lots of code within `ArrayBoundChecker.cpp`.) On a longer time horizon, I'm also planning to use the `ArrayBound` logic as a "backend" for the `CStringOutOfBound` checker, so perhaps we might also think about separating some core logic into a library-like bounds checking source file.

It's quite a coincidence (or perhaps not!). I am also working in parallel in `PointerArithChecker` which is turning out to be very similar too (except it raises the moment the pointer is formed, and that one-after-the-end is still ok, as long as it is not derefenced). So indeed a "core bound checker" library sounds very interesting.

> In fact, I started to work on this change two days ago, but I'll probably need to restart that work because this PR affects lots of code within `ArrayBoundChecker.cpp`.

Sorry 😅  I am thinking, maybe you go ahead with your work on those changes, since you have already progressed. I don't think I have the bandwidth to move fast enough adapting this PR as not to block you for a few weeks. I am happy to revisit these changes and rebase (or even redo) them on top of yours.

And, by the way, I would also be happy to review those changes once you submit them.

https://github.com/llvm/llvm-project/pull/159357