[clang] [clang] add array out-of-bounds access constraints using llvm.assume (PR #159046)

Wed Oct 1 07:53:15 PDT 2025

sjoerdmeijer wrote:

I had a little play again with this patch, the updated one. The short summary is:
- I am a little concerned how intrusive this is, i.e. I'm concerned about the impact on compile time and performance. For my little example, the number of IR instructions in the vector body is about twice bigger, but the final codegen for the vector body is the same though, which is a good thing and an improvement. But there are some codegen changes in the scalar loop. So my prediction is that it is not going to be compile-time friendly, and second, we might see all sorts of performance corner cases, but only numbers will tell I guess...
- Maybe this is getting ahead of things (i.e. numbers), but maybe we can have a little think whether we can be more selective with emitting this intrinsics.

Here's the longer story, the code examples I played with.

Small extension of the example in the description:
```
int arr[10];
int test_simple_array(int i, int n, int * __restrict A, int * __restrict B) {
  for (int i = 0; i< n; ++i)
      arr[i] += A[i] * B[i];
   return arr[i];
}
```

The original vector body before is:

```
11:                                               ; preds = %11, %9
  %12 = phi i64 [ 0, %9 ], [ %29, %11 ]
  %13 = getelementptr inbounds nuw i32, ptr %2, i64 %12
  %14 = getelementptr inbounds nuw i8, ptr %13, i64 16
  %15 = load <4 x i32>, ptr %13, align 4, !tbaa !6
  %16 = load <4 x i32>, ptr %14, align 4, !tbaa !6
  %17 = getelementptr inbounds nuw i32, ptr %3, i64 %12
  %18 = getelementptr inbounds nuw i8, ptr %17, i64 16
  %19 = load <4 x i32>, ptr %17, align 4, !tbaa !6
  %20 = load <4 x i32>, ptr %18, align 4, !tbaa !6
  %21 = mul nsw <4 x i32> %19, %15
  %22 = mul nsw <4 x i32> %20, %16
  %23 = getelementptr inbounds nuw i32, ptr @arr, i64 %12
  %24 = getelementptr inbounds nuw i8, ptr %23, i64 16
  %25 = load <4 x i32>, ptr %23, align 4, !tbaa !6
  %26 = load <4 x i32>, ptr %24, align 4, !tbaa !6
  %27 = add nsw <4 x i32> %25, %21
  %28 = add nsw <4 x i32> %26, %22
  store <4 x i32> %27, ptr %23, align 4, !tbaa !6
  store <4 x i32> %28, ptr %24, align 4, !tbaa !6
  %29 = add nuw i64 %12, 8
  %30 = icmp eq i64 %29, %10
  br i1 %30, label %31, label %11, !llvm.loop !10
```

And after with this patch:

```
11:                                               ; preds = %11, %9
  %12 = phi i64 [ 0, %9 ], [ %41, %11 ]
  %13 = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %9 ], [ %42, %11 ]
  %14 = add <4 x i64> %13, splat (i64 4)
  %15 = getelementptr inbounds nuw i32, ptr %2, i64 %12
  %16 = getelementptr inbounds nuw i8, ptr %15, i64 16
  %17 = load <4 x i32>, ptr %15, align 4, !tbaa !6
  %18 = load <4 x i32>, ptr %16, align 4, !tbaa !6
  %19 = getelementptr inbounds nuw i32, ptr %3, i64 %12
  %20 = getelementptr inbounds nuw i8, ptr %19, i64 16
  %21 = load <4 x i32>, ptr %19, align 4, !tbaa !6
  %22 = load <4 x i32>, ptr %20, align 4, !tbaa !6
  %23 = mul nsw <4 x i32> %21, %17
  %24 = mul nsw <4 x i32> %22, %18
  %25 = icmp ult <4 x i64> %13, splat (i64 10)
  %26 = icmp ult <4 x i64> %14, splat (i64 10)
  %27 = extractelement <4 x i1> %25, i64 0
  tail call void @llvm.assume(i1 %27)
  %28 = extractelement <4 x i1> %25, i64 1
  tail call void @llvm.assume(i1 %28)
  %29 = extractelement <4 x i1> %25, i64 2
  tail call void @llvm.assume(i1 %29)
  %30 = extractelement <4 x i1> %25, i64 3
  tail call void @llvm.assume(i1 %30)
  %31 = extractelement <4 x i1> %26, i64 0
  tail call void @llvm.assume(i1 %31)
  %32 = extractelement <4 x i1> %26, i64 1
  tail call void @llvm.assume(i1 %32)
  %33 = extractelement <4 x i1> %26, i64 2
  tail call void @llvm.assume(i1 %33)
  %34 = extractelement <4 x i1> %26, i64 3
  tail call void @llvm.assume(i1 %34)
  %35 = getelementptr inbounds nuw i32, ptr @arr, i64 %12
  %36 = getelementptr inbounds nuw i8, ptr %35, i64 16
  %37 = load <4 x i32>, ptr %35, align 4, !tbaa !6
  %38 = load <4 x i32>, ptr %36, align 4, !tbaa !6
  %39 = add nsw <4 x i32> %37, %23
  %40 = add nsw <4 x i32> %38, %24
  store <4 x i32> %39, ptr %35, align 4, !tbaa !6
  store <4 x i32> %40, ptr %36, align 4, !tbaa !6
  %41 = add nuw i64 %12, 8
  %42 = add <4 x i64> %13, splat (i64 8)
  %43 = icmp eq i64 %41, %10
  br i1 %43, label %44, label %11, !llvm.loop !10
```

As I mentioned, the good thing is that this gets optimised away, and the final codegen is the same, but it is quite an expansion. 

The scalar loop before is:

```
.LBB0_7:                                // =>This Inner Loop Header: Depth=1
        ldr     w10, [x12], #4
        ldr     w15, [x13]
        ldr     w14, [x11], #4
        subs    x9, x9, #1
        madd    w10, w14, w10, w15
        str     w10, [x13], #4
        b.ne    .LBB0_7
```

And after with this patch is:
```
.LBB0_6:                                // =>This Inner Loop Header: Depth=1
        ldr     w11, [x2, x10, lsl #2]
        ldr     w12, [x3, x10, lsl #2]
        ldr     w13, [x8, x10, lsl #2]
        madd    w11, w12, w11, w13
        str     w11, [x8, x10, lsl #2]
        add     x10, x10, #1
        cmp     x9, x10
        b.ne    .LBB0_6
```
It might perform the same, but the only thing I'm saying is that it is different and the new version is one instruction longer because the loop is no longer counting down but counting up. 

https://github.com/llvm/llvm-project/pull/159046