[clang] [clang][analyzer] Improved PointerSubChecker (PR #93676)

Donát Nagy via cfe-commits cfe-commits at lists.llvm.org
Thu May 30 11:25:39 PDT 2024


================
@@ -44,24 +44,30 @@ void PointerSubChecker::checkPreStmt(const BinaryOperator *B,
 
   const MemRegion *LR = LV.getAsRegion();
   const MemRegion *RR = RV.getAsRegion();
-
-  if (!(LR && RR))
-    return;
-
-  const MemRegion *BaseLR = LR->getBaseRegion();
-  const MemRegion *BaseRR = RR->getBaseRegion();
-
-  if (BaseLR == BaseRR)
+  if (!LR || !RR)
     return;
 
-  // Allow arithmetic on different symbolic regions.
-  if (isa<SymbolicRegion>(BaseLR) || isa<SymbolicRegion>(BaseRR))
-    return;
+  const auto *ElemLR = dyn_cast<ElementRegion>(LR);
+  const auto *ElemRR = dyn_cast<ElementRegion>(RR);
+  // FIXME: We want to verify that these are elements of an array.
+  // Because behavior of ElementRegion it may be confused with a cast.
+  // There is not a simple way to distinguish it from array element (check the
+  // types?). Because this missing check a warning is missing in the rare case
+  // when two casted pointers to the same region (variable) are subtracted.
----------------
NagyDonat wrote:

I discussed your examples and questions with @whisperity and Zoltán Porkoláb, and I'm trying to summarize what we determined. Note that I'm mostly working from the most recent C++ draft standard (http://eel.is/c++draft/), so some the conclusions may be invalid in C or older versions of C++ (but I tried to highlight the differences that I know about).

(1) In your first example the step `int *b = a` is invalid because the `int[3][3]` array `a` decays to an `int (*)[3]` (pointer to array of 3 `int`s) and that type is not interconvertible with a plain `int *` (see [ [basic.compound] note 5](http://eel.is/c++draft/basic.compound#note-5) in the C++ draft standard). (And if you define `b` as `int (*b)[3] = a`, then the pointer subtraction will become invalid.)

(2) In the second example using a `long *` to initialize `char *` or `short *` variables is _usually_ an error, it's accepted under old C, but in C++ you need an explicit cast and e.g. modern GCC also produces an error (`-Wincompatible-pointer-types` a warning that's by default an error) when it compiles this.

I'd guess that the actual pointer arithmetic _is_ standard-compliant, but the relevant parts of the standard ([[expr.reinterpret.cast] 7](http://eel.is/c++draft/expr.post#expr.reinterpret.cast-7), [[expr.static.cast] 14](http://eel.is/c++draft/expr.post#expr.static.cast-14))are too vague to give a confident answer.

Note that under C++ _accessing_ the element `short b[1]` would be a violation of [[basic.lval] part 11](http://eel.is/c++draft/basic.lval#def:type-accessible), but based on [[defns.access]](http://eel.is/c++draft/defns.access) I'd say that the expression `&b[1]` is _not_ an access of `b[1]`.

(3) In your third example `&a[2][2] - &a[1][1]` is clearly working in practically all implementations, but we're fairly sure that it's not compliant with the C++ standard, because [ [expr.add] part 5.2](https://www.eel.is/c++draft/expr.add#5.2) speaks about "array elements _i_ and _j_ of the same array object _x_" (and later "_i_ - _j_" which shows that "_i_" and "_j_" are scalar numbers) -- while in your example `a[2][2]` and `a[1][1]` are not (numerically indexed) elements within the same array. (This is coherent with [[dcl.array]](http://eel.is/c++draft/dcl.array) where an "array" implicitly means a one-dimensional array structure (whose elements may be other arrays).)

Nevertheless, it may be reasonable to avoid emitting warnings on this kind of code, because that could be better for the users.

(4) The definition of "_what exactly an "array object" is_" and "what is an _element_ of an array" primarily appears at [[dcl.array] part 6](http://eel.is/c++draft/dcl.array#6):

> An object of type “array of N U” consists of a contiguously allocated non-empty set of N subobjects of type U, known as the _elements_ of the array, and numbered 0 to N-1.

(This clearly shows that the _element of an element of_ an array is not an _element of_ the array. The notion of "multi-dimensional arrays" only appears in the _Example_ [[dcl.array] Example 4](http://eel.is/c++draft/dcl.array#example-4) with word choices that suggests that this is just a mathematical/intuitive notion and not something that's well-defined in the standard.)

This general definition is augmented by [[basic.compound] part 3, sentence 11](http://eel.is/c++draft/basic.compound#3.sentence-11) which states that:

> For purposes of pointer arithmetic ([[expr.add]](http://eel.is/c++draft/expr.add)) and comparison ([[expr.rel]](http://eel.is/c++draft/expr.rel), [[expr.eq]](http://eel.is/c++draft/expr.eq)), a pointer past the end of the last element of an array x of n elements is considered to be equivalent to a pointer to a hypothetical array element n of x and an object of type T that is not an array element is considered to belong to an array with one element of type T.

(5) Yes, the sub-arrays of a multidimensional array are separate "array objects" that have their own elements -- but they are also single elements within the same bigger array. This means that if we have `int arr[3][3]`, then
- `int x = &(arr[2]) - &(arr[1])` is valid as the difference of two `int (*)[3]` pointers that point to elements of the same array `arr`
- `int y = &(arr[2][1]) - &(arr[2][2])` is valid as the difference of two `int *` pointers that point to elements of the same array `arr[2]`
- `int z = &(arr[1][1]) - &(arr[2][2])` is invalid, because one pointer points to an element of `arr[1]`, while the other points to element of `arr[2]`
- `int w = arr[2] - arr[1]` is also invalid, because these arrays decay to `int *` pointer and this difference is equivalent to `&(arr[2][0]) - &(arr[1][0])`

(6) "_If an address (of a variable) is converted to a char * is this the same array object as the original variable (for example l)?_" -- My intuition is that here we're dealing with an imagined `char[]` array object that "covers" the same memory region as the original variable, and any pointer trickery that start from the original object and produce `char *` pointers pointing into the original object essentially produces pointers to elements of this imagined `char[]` array. (We might say that this imagined `char[]` array is the [_object representation_](http://eel.is/c++draft/basic.types.general#4) of the array, although that's probably not exactly accurate.)

(7)
> Invalid indexing like `int a[3][3]; int x = a[0][4];` is undefined behavior  but why should then be allowed to use pointers to such an object and index it like an one-dimensional array, [...]

I'm not exactly sure what you're speaking about here. I completely agree that `int a[3][3]; int x = a[0][4];` is undefined behavior, but there may be multiple ways to convert the multidimensional array into a single-dimensional one and some of them might be legitimate.

(8)
> or convert an array pointer into an array with different type and index it

This seems to be forbidden by [[expr.add] part 6](http://eel.is/c++draft/expr.add#6). Note that the [definition of the subscript operator](eel.is/c++draft/expr.post#expr.sub) is based on pointer addition.

(9)
> We should allow anything that points into the same memory block (that is a variable or any array), or only allow pointers to the same array with same type and same size.
I don't understand this sentence; but note that [[expr.add] part 2.2](http://eel.is/c++draft/expr.add#2.2) demands that in pointer subtraction the two operands must be pointers to (cv-qualified or cv-unqualified versions of) the same completely-defined object type.

https://github.com/llvm/llvm-project/pull/93676


More information about the cfe-commits mailing list