<div dir="ltr"><div>Hi Balázs,</div><div><br></div><div>Since <a href="http://reviews.llvm.org">reviews.llvm.org</a> is offline, I am sending my comments below, inline. Thanks for your huge effort in explaining all this!</div><div><br></div><div><div>Overall, I have a feeling that this approach targets only one specific case, which is fine. But I believe we should think about all the other possible cases, so we could get rid of other false positives too:</div><div>1) In case of multidimensional arrays, there may be a symbolic value in any dimension.</div><div>2) What if  there are more symbolic values in the dimensions.</div><div><br></div><div>Cheers,</div><div>Gábor</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Sep 3, 2020 at 4:52 PM Balázs Benics via Phabricator <<a href="mailto:reviews@reviews.llvm.org">reviews@reviews.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">steakhal added a comment.<br>

<br>

In D86874#inline-803844 <<a href="https://reviews.llvm.org/D86874#inline-803844" rel="noreferrer" target="_blank">https://reviews.llvm.org/D86874#inline-803844</a>>, @martong wrote:<br>

<br>

> I really feel that this check would have a better place in the implementation of `eval`. This seems really counter-intuitive to do this stuff at the Checker's level. Is there any reason why we can't do this in `eval`?<br>

> `evalBinOpNN` could return with Unknown, and the state should remain unchanged. Am I missing something?<br>

<br>

Ah, sort of.<br>

To make everything clear, I have to provide examples, building-blocks, reasoning and stuff, so please don't be mad if it's too long.<br>

**I hope you have a warm cup of tee to read through this - seriously - you will need that!**<br></blockquote><div>Man, this requires a warm lunch and then hot cups(!) of coffee. :D</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Diving in<br>

---------<br>

<br>

It is really important to make a clear distinction between evaluating an expression according to the semantics of the //meta-language// or of the //object-language//, because we might get different answers for the same expression.<br>

<br>

In fact, `evalBinOpNN` evaluates expressions according to the semantics of the //object-language//.<br>

<br>

In our context, meta-language is //mathematics//, and the //object-language// is the semantics of the abstract machine of the c/c++ language.<br>

In mathematics, there is no such thing as overflow or value ranges, unlike in C++.<br>

<br>

Let's see an example:<br>

Assuming that `x` is in range `[1,UINT_MAX]`.<br>

`x + 1` will be in range `[2,ULL_MAX+1]` in the //meta-language//.<br>

`x + 1` will be in range `[0,0]u[2,UINT_MAX]` or in `[2,UINT_MAX+1]` weather the type of `x` is capable representing the (+1) value and the overflow is well-defined or not.<br>

<br>

Another valuable comment is that non-contiguous ranges (range sets) are not really useful.<br>

Knowing that `x` is in range `[0,0]u[2,UINT_MAX]` doesn't help much if you want to prove that:<br>

`x < 5` holds for all possible interpretations of `x`.<br>

We can not disproof that either.<br>

<br>

So, overflow/underflow can really screw the value ranges, preventing us from evaluating expressions over relational operators.<br>

Which is technically accurate, but not satisfiable in all cases - like in the checker `ArrayBoundCheckerV2`.<br>

<br>

---<br>

<br>

Before describing why is it so problematic in this checker, I should give an overview, how this checker works.<br>

<br>

Overview of the logic of the ArrayBoundCheckerV2<br>

------------------------------------------------<br>

<br>

The checker checks `Location` accesses (aka. pointer dereferences).<br>

We construct the `RegionRawOffsetV2` object of the access (Which consists of the //base region//, and the symbolic //byte offset// expression of the access).<br>

<br>

Then we check, whether we access an element //preceding// the **first valid index** of the //base// memory region.<br>

In other words, we check if the //byte offset// symbolic expression is **less then** 0:<br>

<br>

- If YES:   Report that we access an out-of-bounds element.<br>

- If NO:    Infer the range constraints on the symbol and add the constraint to make this array access valid.<br>

- If MAYBE: Infer and constrain, just as you would do in the previous case.<br>

<br>

Then we check, whether we access an element //exceeding// the **last valid index** of the memory region.<br>

In other words, we check if the //byte offset// symbolic expression is greater then or equal to the //extent// of the region:<br>

<br>

- If YES:   Report the bug...<br>

- If NO:    Infer & constrain...<br>

- If MAYBE: Infer & constrain...<br>

<br>

However, in the checker, we don't check `byte offset < 0` directly.<br>

We //simplify// the left part of the `byte offset < 0` inequality by substracting/dividing both sides with the constant `C`, if the `byte offset` is a `SymintExpr` of `SymExpr OP C` over the plus or multiplication operator (OP).<br>

We do this //simplification// recursively.<br>

In the end, we will have a //symbolic root// (call it `RootSym`) expression and a `C'` constant over the right-hand side of the original relational operator.<br>

So, after the //simplification// process we get the `RootSym < C'` inequality, which we check.<br></blockquote><div>Just for the record, this is in `getSimplifiedOffsets`.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

This //simplification// is the //infer// operation in the pseudo-code.<br>

And the //constrain// step is where we assume that the negation of `RootSym < C'` holds.<br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

**SPOILER**: This //simplification// process is only **valid** using the semantics of the //meta-language//.<br>

<br>

ElementRegions<br>

--------------<br>

<br>

Pointer values, which describe a particular element of a memory region, can get quite complex.<br>

Even more complex, when we reinterpret cast a pointer to a different type.<br>

In such cases, we might wrap the `SubRegion` symbol within an `ElementRegion` with the given target type and offset `0`.<br>

Eg:<br>

<br>

  void foo(int x) {<br>

    int buf[10];<br>

    char *p = (char*)(buf + 2) + 3;<br>

                            ^-- 2*sizeof(int) in bytes which is 8.<br>

    *p = 42; // Not UB.<br>

    int *q = (int*)p; // unaligned pointer!<br>

    *q = 5; // UB.<br>

<br>

    char *r = (char*)(buf + x) + 3;<br>

    *r = 66; // Might be safe, if x has a value to make it so.<br>

  }<br>

<br>

<br>

<br>

RegionRawOffsetV2<br>

-----------------<br>

<br>

In the previous example `p` would have the `&Element{buf,11 S64b,char}` `SymbolRegionVal` (`SVal` instance) associated with.<br>

And `q`: `&Element{Element{buf,11 S64b,char},0 S64b,int}`.<br>

<br>

However, the `RegionRawOffsetV2` of both `p` and `q` would be the same: {BaseRegion: `buf`, ByteOffset: `11`}<br>

Therefore,`RegionRawOffsetV2` gives us a unified //view// of the memory we are dealing with.<br>

It is also useful dealing with pointer aliasing, just in the example with `p` and `q`.<br>

<br>

In other cases, where we index with a **symbol**, the `ByteOffset` might contain a complex //symbolic expression//, instead of a `ConcreteInt`.<br>

Eg: the `RegionRawOffsetV2` of `buf[x+1]` will be {BaseRegion: `buf`, ByteOffset: `x+1`}.<br>

It's important to note that now the //symbolic expression// represented by the `ByteOffset`, is an expression in the //object-language//, where we have to deal with overflows, underflows, implicit conversions, promotions another nasty stuff according to the //abstract machine//.<br>

<br>

//Note: I don't know what `RegionRawOffset` is, or what that does.//<br></blockquote><div>Seems like that is just a pair of a `MemRegion` plus a concrete int `Offset`. And this is the return type for `ElementRegion::getAsArrayOffset()` where we handle only `nonloc::ConcreteInt`s. So, this is similar to RegionRawOffsetV2, but without the possibility of the symbolic index value.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

ElementRegions, again<br>

---------------------<br>

<br>

ElementRegions are might be nested as in this example:<br>

<br>

  void foo(int x) {<br>

    int buf[10][3];<br>

    int *p = &buf[1][2];<br>

    clang_analyzer_dump(p); // &Element{Element{buf,1 S64b,int [3]},2 S64b,int}<br>

    clang_analyzer_DumpRegionRawOffsetV2(p); // raw_offset_v2{buf,20 S64b}<br>

<br>

    int *q = &buf[x+1][6];<br>

    clang_analyzer_dump(q); // &Element{Element{buf,(reg_$0<int x>) + 1,int [3]},6 S64b,int}<br>

    clang_analyzer_DumpRegionRawOffsetV2(q); // raw_offset_v2{buf,(((reg_$0<int x>) + 1) * 12) + 24}<br>

  }<br>

<br>

In the example, the subscript expressions are expressions in the //object-language//.<br>

So, when we //touch// (evaluate, simplify, reorder, etc.) the `(reg_$0<int x>) + 1` we should //thouch// them just as the //abstract machine// would.<br>

<br>

Calculation of the RegionRawOffsetV2<br>

------------------------------------<br>

<br>

Let's see how does these subscript expressions used during the calculation of the `RegionRawOffsetV2`:<br>

For multi-dimension arrays we //fold// the index expressions of the nested `ElementRegion`s.<br>

We are doing that by simply calculating the byte-offset of the nested region, and adding the current level's //index*sizeof(Type)//.<br>

So, the `ByteOffset` that we build up will contain mostly multiplications by a constant OR additions of symbolic expressions (like the `x+1` in the example).<br></blockquote><div>We have these lines in the case of handling the ElementRegion: </div><div>        if (!index.getAs<NonLoc>())<br>          return RegionRawOffsetV2();<br></div><div>So, if the index is symbolic we give up, don't we? So, how could this work with `x+1`? What am I missing here?</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

During this //folding// operation we use the semantics of the //object-language// in the code <<a href="https://github.com/llvm/llvm-project/blob/bda3dd0d986b33c3a327c0ee0eb8ba43aa140699/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp#L335-L342" rel="noreferrer" target="_blank">https://github.com/llvm/llvm-project/blob/bda3dd0d986b33c3a327c0ee0eb8ba43aa140699/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp#L335-L342</a>>.<br>

<br>

  The resulting `RegionRawOffsetV2` for `p`, `q` will be:<br>

  p: {BaseRegion: `buf`, ByteOffset: `20 S64b`}<br>

  q: {BaseRegion: `buf`, ByteOffset: `(((reg_$0<int x>) + 1) * 12) + 24`}<br>

                                        ^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^<br>

                                                     |           |<br>

  `@a` This is an //object-language// expression. --/            |<br>

                                                                /<br>

  `@b` The rest should be //meta-language// expression. -------/<br>

<br>

SPOILER: This distinction is really important here.<br>

<br>

So, we made an expression, we should not evaluate in either of the worlds.<br>

We should not evaluate it using the semantics of the //object-language// since only the `@a` is of that world.<br>

We should model overflow in `@a` if that expression is of //unsigned// type (OR signed but `-fwrapv`...) etc. according to the //abstract machine//.<br>

BUT modeling the possibility of an overflow of the rest of the expression (`@b`) would be a flaw.<br></blockquote><div>Why? I'd expect that the value of the whole expression `@a@b` could overflow.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

This reasoning is the same for the other way around.<br>

<br>

---<br>

<br>

In general, such //mixed// expressions make no sense at all.<br>

Trying to evaluate such would lead to esoteric bugs just like the mentioned bug in the summary.<br>

<br>

Simplify step, again<br>

--------------------<br>

<br>

  Simplification of the `(((reg_$0<int x>) + 1) * 12) + 24` < `0` inequality...<br>

                                                ^^^^^^^^^^<br>

                                                   `@b`<br>

  Result:  `reg_$0<int x> < -3 S64b`<br>

            ^^^^^^^^^^^^^   ^^^^^^^<br>

  `RootSym` --/                |<br>

                              /<br>

  `C'` ----------------------/<br>

<br>

`#1`: This step supposed to fold **ONLY** the //meta-language// expression parts (`@b`) of the previously aquired `ByteOffset`.<br>

`#2`: This step requires the expression under simplified to be of //meta-language// to be able to reorder the constants. (aka. to fold into the right hand side's constant).<br>

`#3`: This also requires the right-hand side to be of the //meta-language//.<br></blockquote><div>Do I understand this correctly, that none of the binary operations can have a symbolic RHS, because that would mean we have a VLA?</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Cause of the bug<br>

----------------<br>

<br>

We don't respect the `#1` and the `#3` requirements of the //simplification// step.<br>

<br>

We treat the complete expression under //simplification// as an expression of the //meta-language//.<br>

I'm not changing this, but I probably should though.<br></blockquote><div>Again, I don't understand why you are sure that the value of //whole// expression cannot overflow.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Ideally, after //simplification// we should have this inequality: `x+1 < -2`<br>

That way we would not fold any subexpression of the //object-language//, so the `#1` requirement would be preserved.<br>

The right-hand side is of an expression of the //meta-language//.<br>

It makes sense, to evaluate the `operator<` as the semantics of the //meta-language//.<br>

But the left-hand side is an expression of the //object-language//.<br>

We need some sort of //logical// conversion here.<br></blockquote><div><div>What if the second index is also symbolic? E.g `buf[x+1][y+1]`?</div><div>This would result in `(((reg_$0<int x>) + 1) * 12) + (reg_$1<int y>) + 1)` < `0` . And after simplification, the RHS cannot be interpreted as //meta//.</div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

If `x+1` is //unsigned// then there is no way that expression could result a //negative// value, so it will be always less then `-2`.<br>

Therefore, we proved that no out-of-bounds buffer access can happen //before// the first valid index (0).<br>

<br>

Otherwise, we evaluate the inequality according to the //object-language//.<br>

We call the `evalBinOpNN` to accomplish this query.<br>

That function, however, should respect the `-fwrapv`, overflow/underflow, and other nasty things of C++ but I don't care how that is implemented.<br>

The checker's job is done at this point.<br>

<br>

Note that, if the //inequality// is of the form `x - C < C'` after the //ideological-simplification// step, we have to deal with underflow similarly.<br>

In this patch, I haven't done it though. Since currently, we just fold `C` into `C'`...<br>

<br>

If the `LHS` is signed, ehm, I would have to think it through - but I'm too tired of writing this...<br>

<br>

The proposed fix<br>

----------------<br>

<br>

I know that this doesn't fix the root of the problem, but we have to start somewhere :D<br>

<br>

Check if the resulting //folded constant// (`C'`) is negative or not.</blockquote><div>What happens if the symbolic index is not the most inner index? E.g. `buf[6][x+1]`. Then `C` is no longer constant, and no longer //meta// ... </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

If it is //negative// and the type of the //root expression// (`RootSym`) has **unsigned** type then I state that this inequality doesn't hold for any (unsigned) interpretation of `RootSym`.<br>

So just move on and check the upper-bound...<br>

<br>

---<br>

<br>

I'm not sure if this fixes all the false-positives, but certainly removes some of them.<br>

I'm not an expert in //abstract interpretation// and in //formal verification//.<br>

I hope you enjoyed reading this, now you should take a rest too!<br>

<br>

<br>

Repository:<br>

  rG LLVM Github Monorepo<br>

<br>

CHANGES SINCE LAST ACTION<br>

  <a href="https://reviews.llvm.org/D86874/new/" rel="noreferrer" target="_blank">https://reviews.llvm.org/D86874/new/</a><br>

<br>

<a href="https://reviews.llvm.org/D86874" rel="noreferrer" target="_blank">https://reviews.llvm.org/D86874</a><br>

<br></blockquote><div><br></div></div></div>