[PATCH] D86874: [analyzer] Fix ArrayBoundCheckerV2 false positive regarding size_t indexer

Mon Sep 7 07:11:22 PDT 2020

steakhal added a comment.

In D86874#2256249 <https://reviews.llvm.org/D86874#2256249>, @martong wrote:

>> Calculation of the RegionRawOffsetV2
>> ------------------------------------
>>
>> Let's see how does these subscript expressions used during the calculation of the `RegionRawOffsetV2`:
>> For multi-dimension arrays we //fold// the index expressions of the nested `ElementRegion`s.
>> We are doing that by simply calculating the byte-offset of the nested region, and adding the current level's //index*sizeof(Type)//.
>> So, the `ByteOffset` that we build up will contain mostly multiplications by a constant OR additions of symbolic expressions (like the `x+1` in the example).
>
> We have these lines in the case of handling the ElementRegion:
>
>   if (!index.getAs<NonLoc>())
>     return RegionRawOffsetV2();
>
> So, if the index is symbolic we give up, don't we? So, how could this work with `x+1`? What am I missing here?

In my example, I'm speaking about the case when it's `nonloc::SymbolVal` wrapping the expression `x+1`. So not every `NonLoc` is `ConcreteInt`.

>>   The resulting `RegionRawOffsetV2` for `p`, `q` will be:
>>   p: {BaseRegion: `buf`, ByteOffset: `20 S64b`}
>>   q: {BaseRegion: `buf`, ByteOffset: `(((reg_$0<int x>) + 1) * 12) + 24`}
>>                                         ^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^
>>                                                      |           |
>>   `@a` This is an //object-language// expression. --/            |
>>                                                                 /
>>   `@b` The rest should be //meta-language// expression. -------/
>>
>> SPOILER: This distinction is really important here.
>>
>> So, we made an expression, we should not evaluate in either of the worlds.
>> We should not evaluate it using the semantics of the //object-language// since only the `@a` is of that world.
>> We should model overflow in `@a` if that expression is of //unsigned// type (OR signed but `-fwrapv`...) etc. according to the //abstract machine//.
>> BUT modeling the possibility of an overflow of the rest of the expression (`@b`) would be a flaw.
>
> Why? I'd expect that the value of the whole expression `@a at b` could overflow.

Hm, you might be right about that.
This //simplification// (over the plus operator) is only valid if no overflow can happen.
We could add the necessary constraints during this constant folding operation such as:
 Simplifying the expression `x+1 < 0` into `x<-1` //assuming that// `x <= numeric_limit_max(typeof(x)) - 1`
We could repeat this, while we have a //state// where all such assumptions are true.
I'm looking forward to implementing it.

>> Simplify step, again
>> --------------------
>>
>>   Simplification of the `(((reg_$0<int x>) + 1) * 12) + 24` < `0` inequality...
>>                                                 ^^^^^^^^^^
>>                                                    `@b`
>>   Result:  `reg_$0<int x> < -3 S64b`
>>             ^^^^^^^^^^^^^   ^^^^^^^
>>   `RootSym` --/                |
>>                               /
>>   `C'` ----------------------/
>>
>> `#1`: This step supposed to fold **ONLY** the //meta-language// expression parts (`@b`) of the previously aquired `ByteOffset`.
>> `#2`: This step requires the expression under simplified to be of //meta-language// to be able to reorder the constants. (aka. to fold into the right hand side's constant).
>> `#3`: This also requires the right-hand side to be of the //meta-language//.
>
> Do I understand this correctly, that none of the binary operations can have a symbolic RHS, because that would mean we have a VLA?

I think yes, you are right.
However, as I tried to build an example demonstrating this, it turned out that it does not handle VLAs quite well.

  void foo(int x, int y) {
    int buf[x][y]; // VLA
    clang_analyzer_dump(&buf[1][2]);
    // &Element{Element{buf,1 S64b,int [y]},2 S64b,int}

    clang_analyzer_DumpRegionRawOffsetV2AndSimplify(&buf[1][2]);
    // RawOffsetV2: {BaseRegion: buf, ByteOffset: 8 S64b}
    // simplification result: {RootSymbol: 8 S64b, FoldedConstant: 0 S64b)
  }

The RawOffsetV2 value is wrong. It supposed to have the ByteOffset: `(1 * sizeof(int[y])) + 8`.

>> We treat the complete expression under //simplification// as an expression of the //meta-language//.
>> I'm not changing this, but I probably should though.
>
> Again, I don't understand why you are sure that the value of //whole// expression cannot overflow.
>
>> Ideally, after //simplification// we should have this inequality: `x+1 < -2`
>> That way we would not fold any subexpression of the //object-language//, so the `#1` requirement would be preserved.
>> The right-hand side is of an expression of the //meta-language//.
>> It makes sense, to evaluate the `operator<` as the semantics of the //meta-language//.
>> But the left-hand side is an expression of the //object-language//.
>> We need some sort of //logical// conversion here.
>
> What if the second index is also symbolic? E.g `buf[x+1][y+1]`?
> This would result in `(((reg_$0<int x>) + 1) * 12) + (reg_$1<int y>) + 1)` < `0` . And after simplification, the RHS cannot be interpreted as //meta//.

As the analyzer does not symbolicate such indexing, the pointer value will be `Unknown`, so the //simplification// process does not apply.
At best we can hope to achieve something like this:

  BaseRegion: buf, ByteOffset: (reg_$0<int x>) + (reg_$1<int y>) + 16
  After simplification: (reg_$0<int x>) + (reg_$1<int y>)   <  -16

But to get this, we need some sort of a //canonical// form of expressions having multiple symbolic values.
For now, it seems reasonable to me to squash to the left all symbolic values.

>> Check if the resulting //folded constant// (`C'`) is negative or not.
>
> What happens if the symbolic index is not the most inner index? E.g. `buf[6][x+1]`. Then `C` is no longer constant, and no longer //meta// ...

In this particular case, I would expect the BytesOffset: `(((reg_$0<int x>) + 1) * 4) + 72`, where 72 would be the `C` constant.
This example is interesting from a different standpoint. The //canonical// form of expressions, again.
I think for our purposes `((reg_$0<int x>) * 4) + 76` form would be a better choice since we need a single multiplication and a single addition.
To get this, we need to transform `(Symbol + C) * C'` => `(Symbol * C') + (C * C')`.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86874/new/

https://reviews.llvm.org/D86874