[llvm] [LV] Compute register usage for interleaving on VPlan. (PR #126437)

Fri May 9 03:46:19 PDT 2025

lukel97 wrote:

> Sure, here's the IR reproducer to be executed via `opt -passes=loop-vectorize reproducer.ll -S -o -` [reproducer.txt](https://github.com/user-attachments/files/20118018/reproducer.txt) (changed the file extension to .txt to allow uploading).
> 
> The C++ function and the compiler flags used to generate the IR are here https://godbolt.org/z/ebv3nc4hY

I took a quick look at this and my guess is that it's due to the interleave group recipes hoisting the loads and extending their live ranges. With the old legacy cost model, the loads are located beside their fmul/fadd uses:

```llvm
  %indvars.iv = phi i64 [ 0, %.preheader ], [ %indvars.iv.next, %5 ]
  %.118 = phi float [ 0.000000e+00, %.preheader ], [ %39, %5 ]
  %6 = getelementptr inbounds nuw [32000 x float], ptr @a, i64 0, i64 %indvars.iv
  %7 = load float, ptr %6, align 4, !tbaa !8
  %8 = getelementptr inbounds nuw [32000 x float], ptr @b, i64 0, i64 %indvars.iv
  %9 = load float, ptr %8, align 4, !tbaa !8
  %10 = fmul fast float %9, %7
  %11 = fadd fast float %10, %.118
  %12 = add nuw nsw i64 %indvars.iv, 1
  %13 = getelementptr inbounds nuw [32000 x float], ptr @a, i64 0, i64 %12
  %14 = load float, ptr %13, align 4, !tbaa !8
  %15 = getelementptr inbounds nuw [32000 x float], ptr @b, i64 0, i64 %12
  %16 = load float, ptr %15, align 4, !tbaa !8
  %17 = fmul fast float %16, %14
  %18 = fadd fast float %11, %17
  %19 = add nuw nsw i64 %indvars.iv, 2
  %20 = getelementptr inbounds nuw [32000 x float], ptr @a, i64 0, i64 %19
  %21 = load float, ptr %20, align 4, !tbaa !8
  %22 = getelementptr inbounds nuw [32000 x float], ptr @b, i64 0, i64 %19
  %23 = load float, ptr %22, align 4, !tbaa !8
  %24 = fmul fast float %23, %21
  %25 = fadd fast float %18, %24
  %26 = add nuw nsw i64 %indvars.iv, 3
  %27 = getelementptr inbounds nuw [32000 x float], ptr @a, i64 0, i64 %26
  %28 = load float, ptr %27, align 4, !tbaa !8
  %29 = getelementptr inbounds nuw [32000 x float], ptr @b, i64 0, i64 %26
  %30 = load float, ptr %29, align 4, !tbaa !8
  %31 = fmul fast float %30, %28
  %32 = fadd fast float %25, %31
  %33 = add nuw nsw i64 %indvars.iv, 4
  %34 = getelementptr inbounds nuw [32000 x float], ptr @a, i64 0, i64 %33
  %35 = load float, ptr %34, align 4, !tbaa !8
  %36 = getelementptr inbounds nuw [32000 x float], ptr @b, i64 0, i64 %33
  %37 = load float, ptr %36, align 4, !tbaa !8
  %38 = fmul fast float %37, %35
  %39 = fadd fast float %32, %38
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 5
  %40 = icmp samesign ult i64 %indvars.iv, 31995
  br i1 %40, label %5, label %3, !llvm.loop !12
```

In VPlan the loads are all now in the one place, and there's now 10 loads alive at the same time:

```llvm
  vector.body:
    EMIT vp<%4> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
    ir<%indvars.iv> = WIDEN-INDUCTION  ir<0>, ir<5>, vp<%0>
    WIDEN-REDUCTION-PHI ir<%.118> = phi ir<0.000000e+00>, ir<%39>
    CLONE ir<%6> = getelementptr inbounds nuw ir<@a>, ir<0>, ir<%indvars.iv>
    INTERLEAVE-GROUP with factor 5 at %7, ir<%6>
      ir<%7> = load from index 0
      ir<%14> = load from index 1
      ir<%21> = load from index 2
      ir<%28> = load from index 3
      ir<%35> = load from index 4
    CLONE ir<%8> = getelementptr inbounds nuw ir<@b>, ir<0>, ir<%indvars.iv>
    INTERLEAVE-GROUP with factor 5 at %9, ir<%8>
      ir<%9> = load from index 0
      ir<%16> = load from index 1
      ir<%23> = load from index 2
      ir<%30> = load from index 3
      ir<%37> = load from index 4
    WIDEN ir<%10> = fmul fast ir<%9>, ir<%7>
    WIDEN ir<%11> = fadd fast ir<%10>, ir<%.118>
    CLONE ir<%12> = add nuw nsw ir<%indvars.iv>, ir<1>
    CLONE ir<%13> = getelementptr inbounds nuw ir<@a>, ir<0>, ir<%12>
    CLONE ir<%15> = getelementptr inbounds nuw ir<@b>, ir<0>, ir<%12>
    WIDEN ir<%17> = fmul fast ir<%16>, ir<%14>
    WIDEN ir<%18> = fadd fast ir<%11>, ir<%17>
    CLONE ir<%19> = add nuw nsw ir<%indvars.iv>, ir<2>
    CLONE ir<%20> = getelementptr inbounds nuw ir<@a>, ir<0>, ir<%19>
    CLONE ir<%22> = getelementptr inbounds nuw ir<@b>, ir<0>, ir<%19>
    WIDEN ir<%24> = fmul fast ir<%23>, ir<%21>
    WIDEN ir<%25> = fadd fast ir<%18>, ir<%24>
    CLONE ir<%26> = add nuw nsw ir<%indvars.iv>, ir<3>
    CLONE ir<%27> = getelementptr inbounds nuw ir<@a>, ir<0>, ir<%26>
    CLONE ir<%29> = getelementptr inbounds nuw ir<@b>, ir<0>, ir<%26>
    WIDEN ir<%31> = fmul fast ir<%30>, ir<%28>
    WIDEN ir<%32> = fadd fast ir<%25>, ir<%31>
    CLONE ir<%33> = add nuw nsw ir<%indvars.iv>, ir<4>
    CLONE ir<%34> = getelementptr inbounds nuw ir<@a>, ir<0>, ir<%33>
    CLONE ir<%36> = getelementptr inbounds nuw ir<@b>, ir<0>, ir<%33>
    WIDEN ir<%38> = fmul fast ir<%37>, ir<%35>
    WIDEN ir<%39> = fadd fast ir<%32>, ir<%38>
    CLONE ir<%indvars.iv.next> = add nuw nsw ir<%indvars.iv>, ir<5>
    CLONE ir<%40> = icmp ult ir<%indvars.iv>, ir<31995>
```

So whilst the legacy model computed a maximum of 3 vector registers used, the VPlan one now calculates 12:

```
# legacy
LV(REG): Found max usage: 2 item
LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
LV(REG): RegisterClass: Generic::VectorRC, 3 registers
# vplan
LV(REG): Found max usage: 2 item
LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
LV(REG): RegisterClass: Generic::VectorRC, 12 registers
```

I'm not sure what a good fix for this is. 

https://github.com/llvm/llvm-project/pull/126437