[PATCH] D86874: [analyzer] Fix ArrayBoundCheckerV2 false positive regarding size_t indexer

Fri Sep 4 01:28:58 PDT 2020

Hi Balázs,

Since reviews.llvm.org is offline, I am sending my comments below, inline.
Thanks for your huge effort in explaining all this!

Overall, I have a feeling that this approach targets only one specific
case, which is fine. But I believe we should think about all the other
possible cases, so we could get rid of other false positives too:
1) In case of multidimensional arrays, there may be a symbolic value in any
dimension.
2) What if  there are more symbolic values in the dimensions.

Cheers,
Gábor

On Thu, Sep 3, 2020 at 4:52 PM Balázs Benics via Phabricator <
reviews at reviews.llvm.org> wrote:

> steakhal added a comment.
>
> In D86874#inline-803844 <https://reviews.llvm.org/D86874#inline-803844>,
> @martong wrote:
>
> > I really feel that this check would have a better place in the
> implementation of `eval`. This seems really counter-intuitive to do this
> stuff at the Checker's level. Is there any reason why we can't do this in
> `eval`?
> > `evalBinOpNN` could return with Unknown, and the state should remain
> unchanged. Am I missing something?
>
> Ah, sort of.
> To make everything clear, I have to provide examples, building-blocks,
> reasoning and stuff, so please don't be mad if it's too long.
> **I hope you have a warm cup of tee to read through this - seriously - you
> will need that!**
>
Man, this requires a warm lunch and then hot cups(!) of coffee. :D

>
> Diving in
> ---------
>
> It is really important to make a clear distinction between evaluating an
> expression according to the semantics of the //meta-language// or of the
> //object-language//, because we might get different answers for the same
> expression.
>
> In fact, `evalBinOpNN` evaluates expressions according to the semantics of
> the //object-language//.
>
> In our context, meta-language is //mathematics//, and the
> //object-language// is the semantics of the abstract machine of the c/c++
> language.
> In mathematics, there is no such thing as overflow or value ranges, unlike
> in C++.
>
> Let's see an example:
> Assuming that `x` is in range `[1,UINT_MAX]`.
> `x + 1` will be in range `[2,ULL_MAX+1]` in the //meta-language//.
> `x + 1` will be in range `[0,0]u[2,UINT_MAX]` or in `[2,UINT_MAX+1]`
> weather the type of `x` is capable representing the (+1) value and the
> overflow is well-defined or not.
>
> Another valuable comment is that non-contiguous ranges (range sets) are
> not really useful.
> Knowing that `x` is in range `[0,0]u[2,UINT_MAX]` doesn't help much if you
> want to prove that:
> `x < 5` holds for all possible interpretations of `x`.
> We can not disproof that either.
>
> So, overflow/underflow can really screw the value ranges, preventing us
> from evaluating expressions over relational operators.
> Which is technically accurate, but not satisfiable in all cases - like in
> the checker `ArrayBoundCheckerV2`.
>
> ---
>
> Before describing why is it so problematic in this checker, I should give
> an overview, how this checker works.
>
> Overview of the logic of the ArrayBoundCheckerV2
> ------------------------------------------------
>
> The checker checks `Location` accesses (aka. pointer dereferences).
> We construct the `RegionRawOffsetV2` object of the access (Which consists
> of the //base region//, and the symbolic //byte offset// expression of the
> access).
>
> Then we check, whether we access an element //preceding// the **first
> valid index** of the //base// memory region.
> In other words, we check if the //byte offset// symbolic expression is
> **less then** 0:
>
> - If YES:   Report that we access an out-of-bounds element.
> - If NO:    Infer the range constraints on the symbol and add the
> constraint to make this array access valid.
> - If MAYBE: Infer and constrain, just as you would do in the previous case.
>
> Then we check, whether we access an element //exceeding// the **last valid
> index** of the memory region.
> In other words, we check if the //byte offset// symbolic expression is
> greater then or equal to the //extent// of the region:
>
> - If YES:   Report the bug...
> - If NO:    Infer & constrain...
> - If MAYBE: Infer & constrain...
>
> However, in the checker, we don't check `byte offset < 0` directly.
> We //simplify// the left part of the `byte offset < 0` inequality by
> substracting/dividing both sides with the constant `C`, if the `byte
> offset` is a `SymintExpr` of `SymExpr OP C` over the plus or multiplication
> operator (OP).
> We do this //simplification// recursively.
> In the end, we will have a //symbolic root// (call it `RootSym`)
> expression and a `C'` constant over the right-hand side of the original
> relational operator.
> So, after the //simplification// process we get the `RootSym < C'`
> inequality, which we check.
>
Just for the record, this is in `getSimplifiedOffsets`.

>
> This //simplification// is the //infer// operation in the pseudo-code.
> And the //constrain// step is where we assume that the negation of
> `RootSym < C'` holds.
>

> **SPOILER**: This //simplification// process is only **valid** using the
> semantics of the //meta-language//.
>
> ElementRegions
> --------------
>
> Pointer values, which describe a particular element of a memory region,
> can get quite complex.
> Even more complex, when we reinterpret cast a pointer to a different type.
> In such cases, we might wrap the `SubRegion` symbol within an
> `ElementRegion` with the given target type and offset `0`.
> Eg:
>
>   void foo(int x) {
>     int buf[10];
>     char *p = (char*)(buf + 2) + 3;
>                             ^-- 2*sizeof(int) in bytes which is 8.
>     *p = 42; // Not UB.
>     int *q = (int*)p; // unaligned pointer!
>     *q = 5; // UB.
>
>     char *r = (char*)(buf + x) + 3;
>     *r = 66; // Might be safe, if x has a value to make it so.
>   }
>
>
>
> RegionRawOffsetV2
> -----------------
>
> In the previous example `p` would have the `&Element{buf,11 S64b,char}`
> `SymbolRegionVal` (`SVal` instance) associated with.
> And `q`: `&Element{Element{buf,11 S64b,char},0 S64b,int}`.
>
> However, the `RegionRawOffsetV2` of both `p` and `q` would be the same:
> {BaseRegion: `buf`, ByteOffset: `11`}
> Therefore,`RegionRawOffsetV2` gives us a unified //view// of the memory we
> are dealing with.
> It is also useful dealing with pointer aliasing, just in the example with
> `p` and `q`.
>
> In other cases, where we index with a **symbol**, the `ByteOffset` might
> contain a complex //symbolic expression//, instead of a `ConcreteInt`.
> Eg: the `RegionRawOffsetV2` of `buf[x+1]` will be {BaseRegion: `buf`,
> ByteOffset: `x+1`}.
> It's important to note that now the //symbolic expression// represented by
> the `ByteOffset`, is an expression in the //object-language//, where we
> have to deal with overflows, underflows, implicit conversions, promotions
> another nasty stuff according to the //abstract machine//.
>
> //Note: I don't know what `RegionRawOffset` is, or what that does.//
>
Seems like that is just a pair of a `MemRegion` plus a concrete int
`Offset`. And this is the return type for
`ElementRegion::getAsArrayOffset()` where we handle only
`nonloc::ConcreteInt`s. So, this is similar to RegionRawOffsetV2, but
without the possibility of the symbolic index value.

>
> ElementRegions, again
> ---------------------
>
> ElementRegions are might be nested as in this example:
>
>   void foo(int x) {
>     int buf[10][3];
>     int *p = &buf[1][2];
>     clang_analyzer_dump(p); // &Element{Element{buf,1 S64b,int [3]},2
> S64b,int}
>     clang_analyzer_DumpRegionRawOffsetV2(p); // raw_offset_v2{buf,20 S64b}
>
>     int *q = &buf[x+1][6];
>     clang_analyzer_dump(q); // &Element{Element{buf,(reg_$0<int x>) +
> 1,int [3]},6 S64b,int}
>     clang_analyzer_DumpRegionRawOffsetV2(q); //
> raw_offset_v2{buf,(((reg_$0<int x>) + 1) * 12) + 24}
>   }
>
> In the example, the subscript expressions are expressions in the
> //object-language//.
> So, when we //touch// (evaluate, simplify, reorder, etc.) the `(reg_$0<int
> x>) + 1` we should //thouch// them just as the //abstract machine// would.
>
> Calculation of the RegionRawOffsetV2
> ------------------------------------
>
> Let's see how does these subscript expressions used during the calculation
> of the `RegionRawOffsetV2`:
> For multi-dimension arrays we //fold// the index expressions of the nested
> `ElementRegion`s.
> We are doing that by simply calculating the byte-offset of the nested
> region, and adding the current level's //index*sizeof(Type)//.
> So, the `ByteOffset` that we build up will contain mostly multiplications
> by a constant OR additions of symbolic expressions (like the `x+1` in the
> example).
>
We have these lines in the case of handling the ElementRegion:
        if (!index.getAs<NonLoc>())
          return RegionRawOffsetV2();
So, if the index is symbolic we give up, don't we? So, how could this work
with `x+1`? What am I missing here?

> During this //folding// operation we use the semantics of the
> //object-language// in the code <
> https://github.com/llvm/llvm-project/blob/bda3dd0d986b33c3a327c0ee0eb8ba43aa140699/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp#L335-L342
> >.
>
>   The resulting `RegionRawOffsetV2` for `p`, `q` will be:
>   p: {BaseRegion: `buf`, ByteOffset: `20 S64b`}
>   q: {BaseRegion: `buf`, ByteOffset: `(((reg_$0<int x>) + 1) * 12) + 24`}
>                                         ^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^
>                                                      |           |
>   `@a` This is an //object-language// expression. --/            |
>                                                                 /
>   `@b` The rest should be //meta-language// expression. -------/
>
> SPOILER: This distinction is really important here.
>
> So, we made an expression, we should not evaluate in either of the worlds.
> We should not evaluate it using the semantics of the //object-language//
> since only the `@a` is of that world.
> We should model overflow in `@a` if that expression is of //unsigned//
> type (OR signed but `-fwrapv`...) etc. according to the //abstract
> machine//.
> BUT modeling the possibility of an overflow of the rest of the expression
> (`@b`) would be a flaw.
>
Why? I'd expect that the value of the whole expression `@a at b` could
overflow.

>
> This reasoning is the same for the other way around.
>
> ---
>
> In general, such //mixed// expressions make no sense at all.
> Trying to evaluate such would lead to esoteric bugs just like the
> mentioned bug in the summary.
>
> Simplify step, again
> --------------------
>
>   Simplification of the `(((reg_$0<int x>) + 1) * 12) + 24` < `0`
> inequality...
>                                                 ^^^^^^^^^^
>                                                    `@b`
>   Result:  `reg_$0<int x> < -3 S64b`
>             ^^^^^^^^^^^^^   ^^^^^^^
>   `RootSym` --/                |
>                               /
>   `C'` ----------------------/
>
> `#1`: This step supposed to fold **ONLY** the //meta-language// expression
> parts (`@b`) of the previously aquired `ByteOffset`.
> `#2`: This step requires the expression under simplified to be of
> //meta-language// to be able to reorder the constants. (aka. to fold into
> the right hand side's constant).
> `#3`: This also requires the right-hand side to be of the
> //meta-language//.
>
Do I understand this correctly, that none of the binary operations can have
a symbolic RHS, because that would mean we have a VLA?

> Cause of the bug
> ----------------
>
> We don't respect the `#1` and the `#3` requirements of the
> //simplification// step.
>
> We treat the complete expression under //simplification// as an expression
> of the //meta-language//.
> I'm not changing this, but I probably should though.
>
Again, I don't understand why you are sure that the value of //whole//
expression cannot overflow.

>
> Ideally, after //simplification// we should have this inequality: `x+1 <
> -2`
> That way we would not fold any subexpression of the //object-language//,
> so the `#1` requirement would be preserved.
> The right-hand side is of an expression of the //meta-language//.
> It makes sense, to evaluate the `operator<` as the semantics of the
> //meta-language//.
> But the left-hand side is an expression of the //object-language//.
> We need some sort of //logical// conversion here.
>
What if the second index is also symbolic? E.g `buf[x+1][y+1]`?
This would result in `(((reg_$0<int x>) + 1) * 12) + (reg_$1<int y>) + 1)`
< `0` . And after simplification, the RHS cannot be interpreted as //meta//.

>
> If `x+1` is //unsigned// then there is no way that expression could result
> a //negative// value, so it will be always less then `-2`.
> Therefore, we proved that no out-of-bounds buffer access can happen
> //before// the first valid index (0).
>
> Otherwise, we evaluate the inequality according to the //object-language//.
> We call the `evalBinOpNN` to accomplish this query.
> That function, however, should respect the `-fwrapv`, overflow/underflow,
> and other nasty things of C++ but I don't care how that is implemented.
> The checker's job is done at this point.
>
> Note that, if the //inequality// is of the form `x - C < C'` after the
> //ideological-simplification// step, we have to deal with underflow
> similarly.
> In this patch, I haven't done it though. Since currently, we just fold `C`
> into `C'`...
>
> If the `LHS` is signed, ehm, I would have to think it through - but I'm
> too tired of writing this...
>
> The proposed fix
> ----------------
>
> I know that this doesn't fix the root of the problem, but we have to start
> somewhere :D
>
> Check if the resulting //folded constant// (`C'`) is negative or not.

What happens if the symbolic index is not the most inner index? E.g.
`buf[6][x+1]`. Then `C` is no longer constant, and no longer //meta// ...

> If it is //negative// and the type of the //root expression// (`RootSym`)
> has **unsigned** type then I state that this inequality doesn't hold for
> any (unsigned) interpretation of `RootSym`.
> So just move on and check the upper-bound...
>
> ---
>
> I'm not sure if this fixes all the false-positives, but certainly removes
> some of them.
> I'm not an expert in //abstract interpretation// and in //formal
> verification//.
> I hope you enjoyed reading this, now you should take a rest too!
>
>
> Repository:
>   rG LLVM Github Monorepo
>
> CHANGES SINCE LAST ACTION
>   https://reviews.llvm.org/D86874/new/
>
> https://reviews.llvm.org/D86874
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20200904/0a1537c1/attachment-0001.html>