[llvm-branch-commits] [llvm] [LoopInterchange] Disable LoopCacheAnalysis-based heuristic by default (PR #193478)

Tue Apr 28 07:27:50 PDT 2026

kasuga-fj wrote:

I'm not sure if this is sufficient, but I ran llvm-test-suite with several externals (SPEC2017, FFmpeg, and dav1d) to collect data. Here is a brief summary:

- There are few cases where loop interchange is applied in the first place.
- By enabling the LoopCacheAnalysis-based heuristic, the set of cases where loop interchange is applied becomes strictly larger than when it is disabled.
- Although no performance degradation was observed, the additional cases appear to potentially degrade performance.

## Details

Here is a number of cases where loop interchange is applied with different settings:

| | `instorder,vectorize` | `cache,instorder,vectorize` | `cache,instorder,vectorize` and `cache-line-size=64` |
|-|-|-|-|
| # of applied | 2 | 10 (+8) | 12 (+10) |

Note that the additional +8 and +10 cases are different from each other. I categorized these increased cases into roughly three groups, all of which I believe have the potential to degrade performance.

### Category 1

```c
int **index;
...
for (i = 0; i < N; i++)
  for (j = 0; j < M; j++)
    use(A[index[j][i]]);
```

Although this may improve locality for `index`, it can potentially worsen locality for `A`. In such cases, I think we should be conservative and avoid applying loop interchange.

### Category 2

```fortran
do i = 2,N
  A(:,:,i) = A(:,:,i) + A(:,:,1);
end do
```

In this case, the innermost loop (over the leftmost dimension) was fully unrolled, and the outer loop and the middle loop (over the middle dimension) were interchanged. This did not have a significant impact on performance in the full benchmark, likely because this code accounts for only a small portion of the program. However, when I extracted and ran this part separately, the interchange degraded performance by about 15%. The impact may depend on the array size.

### Category 3

```fortran
do n = 1,N0
  do j = 1,N2
    do i = 1,N1
      A(i,j) = A(i,j) + B(i,n)*C(j)
    end do
  end do
end do
```

Here, the j-loop and i-loop were interchanged. As with Category 2, this did not significantly affect the performance of the full program, but in the extracted part, performance degraded by about 20%.

---

Given these observations, I still think we should disable the LoopCacheAnalysis-based heuristic by default for now. Especially when enabling loop interchange by default, avoiding performance regressions may be more important than increasing the number of applied cases.

That said, I am not strongly committed to pushing in this direction. If others feel that keeping the current behavior is preferable, I am fine with that as well. However, at a glance, the maintenance cost of LoopCacheAnalysis seems relatively high.

What do you think?

> One problem could be that we have stripped out a few thing here and there and thus made things more pessimistic, so it won't be triggering as much as it used to do and may not be showing a lot of uplifts

I don't think this is an issue with the cost model. In my opinion, there are two main reasons why loop interchange is applied less frequently than before:

- The legality check in loop-interchange. I believe some of them will be resolved by #193480 and #193481.
- The capability of DA. From what I have seen, there are inherent limits to relying solely on analysis capabilities of SCEV for overflow-related checks. Probably this is one of the biggest challenges after enabling loop-interchange by default.

https://github.com/llvm/llvm-project/pull/193478