[PATCH] D86874: [analyzer] Fix ArrayBoundCheckerV2 false positive regarding size_t indexer
Balázs Benics via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Thu Sep 3 07:52:22 PDT 2020
steakhal added a comment.
In D86874#inline-803844 <https://reviews.llvm.org/D86874#inline-803844>, @martong wrote:
> I really feel that this check would have a better place in the implementation of `eval`. This seems really counter-intuitive to do this stuff at the Checker's level. Is there any reason why we can't do this in `eval`?
> `evalBinOpNN` could return with Unknown, and the state should remain unchanged. Am I missing something?
Ah, sort of.
To make everything clear, I have to provide examples, building-blocks, reasoning and stuff, so please don't be mad if it's too long.
**I hope you have a warm cup of tee to read through this - seriously - you will need that!**
Diving in
---------
It is really important to make a clear distinction between evaluating an expression according to the semantics of the //meta-language// or of the //object-language//, because we might get different answers for the same expression.
In fact, `evalBinOpNN` evaluates expressions according to the semantics of the //object-language//.
In our context, meta-language is //mathematics//, and the //object-language// is the semantics of the abstract machine of the c/c++ language.
In mathematics, there is no such thing as overflow or value ranges, unlike in C++.
Let's see an example:
Assuming that `x` is in range `[1,UINT_MAX]`.
`x + 1` will be in range `[2,ULL_MAX+1]` in the //meta-language//.
`x + 1` will be in range `[0,0]u[2,UINT_MAX]` or in `[2,UINT_MAX+1]` weather the type of `x` is capable representing the (+1) value and the overflow is well-defined or not.
Another valuable comment is that non-contiguous ranges (range sets) are not really useful.
Knowing that `x` is in range `[0,0]u[2,UINT_MAX]` doesn't help much if you want to prove that:
`x < 5` holds for all possible interpretations of `x`.
We can not disproof that either.
So, overflow/underflow can really screw the value ranges, preventing us from evaluating expressions over relational operators.
Which is technically accurate, but not satisfiable in all cases - like in the checker `ArrayBoundCheckerV2`.
---
Before describing why is it so problematic in this checker, I should give an overview, how this checker works.
Overview of the logic of the ArrayBoundCheckerV2
------------------------------------------------
The checker checks `Location` accesses (aka. pointer dereferences).
We construct the `RegionRawOffsetV2` object of the access (Which consists of the //base region//, and the symbolic //byte offset// expression of the access).
Then we check, whether we access an element //preceding// the **first valid index** of the //base// memory region.
In other words, we check if the //byte offset// symbolic expression is **less then** 0:
- If YES: Report that we access an out-of-bounds element.
- If NO: Infer the range constraints on the symbol and add the constraint to make this array access valid.
- If MAYBE: Infer and constrain, just as you would do in the previous case.
Then we check, whether we access an element //exceeding// the **last valid index** of the memory region.
In other words, we check if the //byte offset// symbolic expression is greater then or equal to the //extent// of the region:
- If YES: Report the bug...
- If NO: Infer & constrain...
- If MAYBE: Infer & constrain...
However, in the checker, we don't check `byte offset < 0` directly.
We //simplify// the left part of the `byte offset < 0` inequality by substracting/dividing both sides with the constant `C`, if the `byte offset` is a `SymintExpr` of `SymExpr OP C` over the plus or multiplication operator (OP).
We do this //simplification// recursively.
In the end, we will have a //symbolic root// (call it `RootSym`) expression and a `C'` constant over the right-hand side of the original relational operator.
So, after the //simplification// process we get the `RootSym < C'` inequality, which we check.
This //simplification// is the //infer// operation in the pseudo-code.
And the //constrain// step is where we assume that the negation of `RootSym < C'` holds.
**SPOILER**: This //simplification// process is only **valid** using the semantics of the //meta-language//.
ElementRegions
--------------
Pointer values, which describe a particular element of a memory region, can get quite complex.
Even more complex, when we reinterpret cast a pointer to a different type.
In such cases, we might wrap the `SubRegion` symbol within an `ElementRegion` with the given target type and offset `0`.
Eg:
void foo(int x) {
int buf[10];
char *p = (char*)(buf + 2) + 3;
^-- 2*sizeof(int) in bytes which is 8.
*p = 42; // Not UB.
int *q = (int*)p; // unaligned pointer!
*q = 5; // UB.
char *r = (char*)(buf + x) + 3;
*r = 66; // Might be safe, if x has a value to make it so.
}
RegionRawOffsetV2
-----------------
In the previous example `p` would have the `&Element{buf,11 S64b,char}` `SymbolRegionVal` (`SVal` instance) associated with.
And `q`: `&Element{Element{buf,11 S64b,char},0 S64b,int}`.
However, the `RegionRawOffsetV2` of both `p` and `q` would be the same: {BaseRegion: `buf`, ByteOffset: `11`}
Therefore,`RegionRawOffsetV2` gives us a unified //view// of the memory we are dealing with.
It is also useful dealing with pointer aliasing, just in the example with `p` and `q`.
In other cases, where we index with a **symbol**, the `ByteOffset` might contain a complex //symbolic expression//, instead of a `ConcreteInt`.
Eg: the `RegionRawOffsetV2` of `buf[x+1]` will be {BaseRegion: `buf`, ByteOffset: `x+1`}.
It's important to note that now the //symbolic expression// represented by the `ByteOffset`, is an expression in the //object-language//, where we have to deal with overflows, underflows, implicit conversions, promotions another nasty stuff according to the //abstract machine//.
//Note: I don't know what `RegionRawOffset` is, or what that does.//
ElementRegions, again
---------------------
ElementRegions are might be nested as in this example:
void foo(int x) {
int buf[10][3];
int *p = &buf[1][2];
clang_analyzer_dump(p); // &Element{Element{buf,1 S64b,int [3]},2 S64b,int}
clang_analyzer_DumpRegionRawOffsetV2(p); // raw_offset_v2{buf,20 S64b}
int *q = &buf[x+1][6];
clang_analyzer_dump(q); // &Element{Element{buf,(reg_$0<int x>) + 1,int [3]},6 S64b,int}
clang_analyzer_DumpRegionRawOffsetV2(q); // raw_offset_v2{buf,(((reg_$0<int x>) + 1) * 12) + 24}
}
In the example, the subscript expressions are expressions in the //object-language//.
So, when we //touch// (evaluate, simplify, reorder, etc.) the `(reg_$0<int x>) + 1` we should //thouch// them just as the //abstract machine// would.
Calculation of the RegionRawOffsetV2
------------------------------------
Let's see how does these subscript expressions used during the calculation of the `RegionRawOffsetV2`:
For multi-dimension arrays we //fold// the index expressions of the nested `ElementRegion`s.
We are doing that by simply calculating the byte-offset of the nested region, and adding the current level's //index*sizeof(Type)//.
So, the `ByteOffset` that we build up will contain mostly multiplications by a constant OR additions of symbolic expressions (like the `x+1` in the example).
During this //folding// operation we use the semantics of the //object-language// in the code <https://github.com/llvm/llvm-project/blob/bda3dd0d986b33c3a327c0ee0eb8ba43aa140699/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp#L335-L342>.
The resulting `RegionRawOffsetV2` for `p`, `q` will be:
p: {BaseRegion: `buf`, ByteOffset: `20 S64b`}
q: {BaseRegion: `buf`, ByteOffset: `(((reg_$0<int x>) + 1) * 12) + 24`}
^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^
| |
`@a` This is an //object-language// expression. --/ |
/
`@b` The rest should be //meta-language// expression. -------/
SPOILER: This distinction is really important here.
So, we made an expression, we should not evaluate in either of the worlds.
We should not evaluate it using the semantics of the //object-language// since only the `@a` is of that world.
We should model overflow in `@a` if that expression is of //unsigned// type (OR signed but `-fwrapv`...) etc. according to the //abstract machine//.
BUT modeling the possibility of an overflow of the rest of the expression (`@b`) would be a flaw.
This reasoning is the same for the other way around.
---
In general, such //mixed// expressions make no sense at all.
Trying to evaluate such would lead to esoteric bugs just like the mentioned bug in the summary.
Simplify step, again
--------------------
Simplification of the `(((reg_$0<int x>) + 1) * 12) + 24` < `0` inequality...
^^^^^^^^^^
`@b`
Result: `reg_$0<int x> < -3 S64b`
^^^^^^^^^^^^^ ^^^^^^^
`RootSym` --/ |
/
`C'` ----------------------/
`#1`: This step supposed to fold **ONLY** the //meta-language// expression parts (`@b`) of the previously aquired `ByteOffset`.
`#2`: This step requires the expression under simplified to be of //meta-language// to be able to reorder the constants. (aka. to fold into the right hand side's constant).
`#3`: This also requires the right-hand side to be of the //meta-language//.
Cause of the bug
----------------
We don't respect the `#1` and the `#3` requirements of the //simplification// step.
We treat the complete expression under //simplification// as an expression of the //meta-language//.
I'm not changing this, but I probably should though.
Ideally, after //simplification// we should have this inequality: `x+1 < -2`
That way we would not fold any subexpression of the //object-language//, so the `#1` requirement would be preserved.
The right-hand side is of an expression of the //meta-language//.
It makes sense, to evaluate the `operator<` as the semantics of the //meta-language//.
But the left-hand side is an expression of the //object-language//.
We need some sort of //logical// conversion here.
If `x+1` is //unsigned// then there is no way that expression could result a //negative// value, so it will be always less then `-2`.
Therefore, we proved that no out-of-bounds buffer access can happen //before// the first valid index (0).
Otherwise, we evaluate the inequality according to the //object-language//.
We call the `evalBinOpNN` to accomplish this query.
That function, however, should respect the `-fwrapv`, overflow/underflow, and other nasty things of C++ but I don't care how that is implemented.
The checker's job is done at this point.
Note that, if the //inequality// is of the form `x - C < C'` after the //ideological-simplification// step, we have to deal with underflow similarly.
In this patch, I haven't done it though. Since currently, we just fold `C` into `C'`...
If the `LHS` is signed, ehm, I would have to think it through - but I'm too tired of writing this...
The proposed fix
----------------
I know that this doesn't fix the root of the problem, but we have to start somewhere :D
Check if the resulting //folded constant// (`C'`) is negative or not.
If it is //negative// and the type of the //root expression// (`RootSym`) has **unsigned** type then I state that this inequality doesn't hold for any (unsigned) interpretation of `RootSym`.
So just move on and check the upper-bound...
---
I'm not sure if this fixes all the false-positives, but certainly removes some of them.
I'm not an expert in //abstract interpretation// and in //formal verification//.
I hope you enjoyed reading this, now you should take a rest too!
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D86874/new/
https://reviews.llvm.org/D86874
More information about the cfe-commits
mailing list