[PATCH] D86874: [analyzer] Fix ArrayBoundCheckerV2 false positive regarding size_t indexer

Thu Sep 3 07:52:22 PDT 2020

steakhal added a comment.

In D86874#inline-803844 <https://reviews.llvm.org/D86874#inline-803844>, @martong wrote:

> I really feel that this check would have a better place in the implementation of `eval`. This seems really counter-intuitive to do this stuff at the Checker's level. Is there any reason why we can't do this in `eval`?
> `evalBinOpNN` could return with Unknown, and the state should remain unchanged. Am I missing something?

Ah, sort of.
To make everything clear, I have to provide examples, building-blocks, reasoning and stuff, so please don't be mad if it's too long.
**I hope you have a warm cup of tee to read through this - seriously - you will need that!**

Diving in
---------

It is really important to make a clear distinction between evaluating an expression according to the semantics of the //meta-language// or of the //object-language//, because we might get different answers for the same expression.

In fact, `evalBinOpNN` evaluates expressions according to the semantics of the //object-language//.

In our context, meta-language is //mathematics//, and the //object-language// is the semantics of the abstract machine of the c/c++ language.
In mathematics, there is no such thing as overflow or value ranges, unlike in C++.

Let's see an example:
Assuming that `x` is in range `[1,UINT_MAX]`.
`x + 1` will be in range `[2,ULL_MAX+1]` in the //meta-language//.
`x + 1` will be in range `[0,0]u[2,UINT_MAX]` or in `[2,UINT_MAX+1]` weather the type of `x` is capable representing the (+1) value and the overflow is well-defined or not.

Another valuable comment is that non-contiguous ranges (range sets) are not really useful.
Knowing that `x` is in range `[0,0]u[2,UINT_MAX]` doesn't help much if you want to prove that:
`x < 5` holds for all possible interpretations of `x`.
We can not disproof that either.

So, overflow/underflow can really screw the value ranges, preventing us from evaluating expressions over relational operators.
Which is technically accurate, but not satisfiable in all cases - like in the checker `ArrayBoundCheckerV2`.

---

Before describing why is it so problematic in this checker, I should give an overview, how this checker works.

Overview of the logic of the ArrayBoundCheckerV2
------------------------------------------------

The checker checks `Location` accesses (aka. pointer dereferences).
We construct the `RegionRawOffsetV2` object of the access (Which consists of the //base region//, and the symbolic //byte offset// expression of the access).

Then we check, whether we access an element //preceding// the **first valid index** of the //base// memory region.
In other words, we check if the //byte offset// symbolic expression is **less then** 0:

- If YES:   Report that we access an out-of-bounds element.
- If NO:    Infer the range constraints on the symbol and add the constraint to make this array access valid.
- If MAYBE: Infer and constrain, just as you would do in the previous case.

Then we check, whether we access an element //exceeding// the **last valid index** of the memory region.
In other words, we check if the //byte offset// symbolic expression is greater then or equal to the //extent// of the region:

- If YES:   Report the bug...
- If NO:    Infer & constrain...
- If MAYBE: Infer & constrain...

However, in the checker, we don't check `byte offset < 0` directly.
We //simplify// the left part of the `byte offset < 0` inequality by substracting/dividing both sides with the constant `C`, if the `byte offset` is a `SymintExpr` of `SymExpr OP C` over the plus or multiplication operator (OP).
We do this //simplification// recursively.
In the end, we will have a //symbolic root// (call it `RootSym`) expression and a `C'` constant over the right-hand side of the original relational operator.
So, after the //simplification// process we get the `RootSym < C'` inequality, which we check.

This //simplification// is the //infer// operation in the pseudo-code.
And the //constrain// step is where we assume that the negation of `RootSym < C'` holds.

**SPOILER**: This //simplification// process is only **valid** using the semantics of the //meta-language//.

ElementRegions
--------------

Pointer values, which describe a particular element of a memory region, can get quite complex.
Even more complex, when we reinterpret cast a pointer to a different type.
In such cases, we might wrap the `SubRegion` symbol within an `ElementRegion` with the given target type and offset `0`.
Eg:

  void foo(int x) {
    int buf[10];
    char *p = (char*)(buf + 2) + 3;
                            ^-- 2*sizeof(int) in bytes which is 8.
    *p = 42; // Not UB.
    int *q = (int*)p; // unaligned pointer!
    *q = 5; // UB.

    char *r = (char*)(buf + x) + 3;
    *r = 66; // Might be safe, if x has a value to make it so.
  }

RegionRawOffsetV2
-----------------

In the previous example `p` would have the `&Element{buf,11 S64b,char}` `SymbolRegionVal` (`SVal` instance) associated with.
And `q`: `&Element{Element{buf,11 S64b,char},0 S64b,int}`.

However, the `RegionRawOffsetV2` of both `p` and `q` would be the same: {BaseRegion: `buf`, ByteOffset: `11`}
Therefore,`RegionRawOffsetV2` gives us a unified //view// of the memory we are dealing with.
It is also useful dealing with pointer aliasing, just in the example with `p` and `q`.

In other cases, where we index with a **symbol**, the `ByteOffset` might contain a complex //symbolic expression//, instead of a `ConcreteInt`.
Eg: the `RegionRawOffsetV2` of `buf[x+1]` will be {BaseRegion: `buf`, ByteOffset: `x+1`}.
It's important to note that now the //symbolic expression// represented by the `ByteOffset`, is an expression in the //object-language//, where we have to deal with overflows, underflows, implicit conversions, promotions another nasty stuff according to the //abstract machine//.

//Note: I don't know what `RegionRawOffset` is, or what that does.//

ElementRegions, again
---------------------

ElementRegions are might be nested as in this example:

  void foo(int x) {
    int buf[10][3];
    int *p = &buf[1][2];
    clang_analyzer_dump(p); // &Element{Element{buf,1 S64b,int [3]},2 S64b,int}
    clang_analyzer_DumpRegionRawOffsetV2(p); // raw_offset_v2{buf,20 S64b}

    int *q = &buf[x+1][6];
    clang_analyzer_dump(q); // &Element{Element{buf,(reg_$0<int x>) + 1,int [3]},6 S64b,int}
    clang_analyzer_DumpRegionRawOffsetV2(q); // raw_offset_v2{buf,(((reg_$0<int x>) + 1) * 12) + 24}
  }

In the example, the subscript expressions are expressions in the //object-language//.
So, when we //touch// (evaluate, simplify, reorder, etc.) the `(reg_$0<int x>) + 1` we should //thouch// them just as the //abstract machine// would.

Calculation of the RegionRawOffsetV2
------------------------------------

Let's see how does these subscript expressions used during the calculation of the `RegionRawOffsetV2`:
For multi-dimension arrays we //fold// the index expressions of the nested `ElementRegion`s.
We are doing that by simply calculating the byte-offset of the nested region, and adding the current level's //index*sizeof(Type)//.
So, the `ByteOffset` that we build up will contain mostly multiplications by a constant OR additions of symbolic expressions (like the `x+1` in the example).
During this //folding// operation we use the semantics of the //object-language// in the code <https://github.com/llvm/llvm-project/blob/bda3dd0d986b33c3a327c0ee0eb8ba43aa140699/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp#L335-L342>.

  The resulting `RegionRawOffsetV2` for `p`, `q` will be:
  p: {BaseRegion: `buf`, ByteOffset: `20 S64b`}
  q: {BaseRegion: `buf`, ByteOffset: `(((reg_$0<int x>) + 1) * 12) + 24`}
                                        ^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^
                                                     |           |
  `@a` This is an //object-language// expression. --/            |
                                                                /
  `@b` The rest should be //meta-language// expression. -------/

SPOILER: This distinction is really important here.

So, we made an expression, we should not evaluate in either of the worlds.
We should not evaluate it using the semantics of the //object-language// since only the `@a` is of that world.
We should model overflow in `@a` if that expression is of //unsigned// type (OR signed but `-fwrapv`...) etc. according to the //abstract machine//.
BUT modeling the possibility of an overflow of the rest of the expression (`@b`) would be a flaw.

This reasoning is the same for the other way around.

---

In general, such //mixed// expressions make no sense at all.
Trying to evaluate such would lead to esoteric bugs just like the mentioned bug in the summary.

Simplify step, again
--------------------

  Simplification of the `(((reg_$0<int x>) + 1) * 12) + 24` < `0` inequality...
                                                ^^^^^^^^^^
                                                   `@b`
  Result:  `reg_$0<int x> < -3 S64b`
            ^^^^^^^^^^^^^   ^^^^^^^
  `RootSym` --/                |
                              /
  `C'` ----------------------/

`#1`: This step supposed to fold **ONLY** the //meta-language// expression parts (`@b`) of the previously aquired `ByteOffset`.
`#2`: This step requires the expression under simplified to be of //meta-language// to be able to reorder the constants. (aka. to fold into the right hand side's constant).
`#3`: This also requires the right-hand side to be of the //meta-language//.

Cause of the bug
----------------

We don't respect the `#1` and the `#3` requirements of the //simplification// step.

We treat the complete expression under //simplification// as an expression of the //meta-language//.
I'm not changing this, but I probably should though.

Ideally, after //simplification// we should have this inequality: `x+1 < -2`
That way we would not fold any subexpression of the //object-language//, so the `#1` requirement would be preserved.
The right-hand side is of an expression of the //meta-language//.
It makes sense, to evaluate the `operator<` as the semantics of the //meta-language//.
But the left-hand side is an expression of the //object-language//.
We need some sort of //logical// conversion here.

If `x+1` is //unsigned// then there is no way that expression could result a //negative// value, so it will be always less then `-2`.
Therefore, we proved that no out-of-bounds buffer access can happen //before// the first valid index (0).

Otherwise, we evaluate the inequality according to the //object-language//.
We call the `evalBinOpNN` to accomplish this query.
That function, however, should respect the `-fwrapv`, overflow/underflow, and other nasty things of C++ but I don't care how that is implemented.
The checker's job is done at this point.

Note that, if the //inequality// is of the form `x - C < C'` after the //ideological-simplification// step, we have to deal with underflow similarly.
In this patch, I haven't done it though. Since currently, we just fold `C` into `C'`...

If the `LHS` is signed, ehm, I would have to think it through - but I'm too tired of writing this...

The proposed fix
----------------

I know that this doesn't fix the root of the problem, but we have to start somewhere :D

Check if the resulting //folded constant// (`C'`) is negative or not.
If it is //negative// and the type of the //root expression// (`RootSym`) has **unsigned** type then I state that this inequality doesn't hold for any (unsigned) interpretation of `RootSym`.
So just move on and check the upper-bound...

---

I'm not sure if this fixes all the false-positives, but certainly removes some of them.
I'm not an expert in //abstract interpretation// and in //formal verification//.
I hope you enjoyed reading this, now you should take a rest too!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86874/new/

https://reviews.llvm.org/D86874