[clang] [Wunsafe-buffer-usage] False positives for & expression indexing constant size array (arr[anything & 0]) (PR #112284)

Wed Oct 30 10:16:40 PDT 2024

================
@@ -420,6 +420,118 @@ AST_MATCHER(CXXConstructExpr, isSafeSpanTwoParamConstruct) {
   return false;
 }
 
+class MaxValueEval : public RecursiveASTVisitor<MaxValueEval> {
+
+  std::vector<llvm::APInt> val;
+  ASTContext &Context;
+  llvm::APInt Max;
+  unsigned bit_width;
+
+public:
+  typedef RecursiveASTVisitor<MaxValueEval> VisitorBase;
+
+  explicit MaxValueEval(ASTContext &Ctx, const Expr *exp) : Context(Ctx) {
+    bit_width = Ctx.getIntWidth(exp->getType());
+    Max = llvm::APInt::getSignedMaxValue(bit_width);
+    val.clear();
+  }
+
+  bool findMatch(Expr *exp) {
+    TraverseStmt(exp);
+    return true;
+  }
+
+  bool VisitDeclRefExpr(DeclRefExpr *dre) {
+    val.push_back(Max);
+    return false;
+  }
+
+  bool VisitArraySubscriptExpr(ArraySubscriptExpr *E) {
+    val.push_back(Max);
+    return false;
+  }
+
+  bool EvaluateExpression(Expr *exp) {
+    Expr::EvalResult EVResult;
+    if (exp->EvaluateAsInt(EVResult, Context)) {
+      llvm::APSInt Result = EVResult.Val.getInt();
+      val.push_back(Result);
+      return true;
+    }
+    return false;
+  }
+
+  bool VisitBinaryOperator(BinaryOperator *E) {
----------------
jkorous-apple wrote:

Let's first establish the scope of this work.

The current approach seems very ambitious (which is generally great!) but misses some of the tricky details.
For example the same arithmetic expression evaluated in APInt domain (Arbitrary Precision Integer) and e. g. `uint32_t` domain gives different results because of overflow/underflow, etc.

I don't think we need to be that ambitious here and now - if we leave some false positives for `-Wunsafe-buffer-usage` for `my_array[ /* some weird expression */ ]` that's fine and we can improve it later using the data from the adoption as a signal for prioritization.
I also assume correct implementation of the most ambitious version of this functionality would take way more effort and time than is reasonable to invest into this problem at this point.

The conclusion for me is - let's support only certain types of expressions for now and keep the implementation simpler.

Based on what we've seen in the adoption and my intuition I feel that this approximation might be sufficient for now (let's discuss what I've missed though!):
`arbitrary_unsigned_E & constant` - max is `constant`
`constant & arbitrary_unsigned_E` - max is `constant`
`arbitrary_unsigned_E % constant` - max is `constant`
`arbitrary_unsigned_E >> constant` - max is something like (this is just buggy pseudo-code) `pow( max(bitwidth(arbitrary_E) - constant, 0), 2 )`

For this we might also avoid using the recursive visitor and simplify the implementation to just a single function where we traverse the expression manually using `getLHS`, `getRHS`, etc.

https://github.com/llvm/llvm-project/pull/112284