[llvm-branch-commits] [llvm] [LoopInterchange] Disable LoopCacheAnalysis-based heuristic by default (PR #193478)
Ryotaro Kasuga via llvm-branch-commits
llvm-branch-commits at lists.llvm.org
Tue Apr 28 07:27:50 PDT 2026
kasuga-fj wrote:
I'm not sure if this is sufficient, but I ran llvm-test-suite with several externals (SPEC2017, FFmpeg, and dav1d) to collect data. Here is a brief summary:
- There are few cases where loop interchange is applied in the first place.
- By enabling the LoopCacheAnalysis-based heuristic, the set of cases where loop interchange is applied becomes strictly larger than when it is disabled.
- Although no performance degradation was observed, the additional cases appear to potentially degrade performance.
## Details
Here is a number of cases where loop interchange is applied with different settings:
| | `instorder,vectorize` | `cache,instorder,vectorize` | `cache,instorder,vectorize` and `cache-line-size=64` |
|-|-|-|-|
| # of applied | 2 | 10 (+8) | 12 (+10) |
Note that the additional +8 and +10 cases are different from each other. I categorized these increased cases into roughly three groups, all of which I believe have the potential to degrade performance.
### Category 1
```c
int **index;
...
for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
use(A[index[j][i]]);
```
Although this may improve locality for `index`, it can potentially worsen locality for `A`. In such cases, I think we should be conservative and avoid applying loop interchange.
### Category 2
```fortran
do i = 2,N
A(:,:,i) = A(:,:,i) + A(:,:,1);
end do
```
In this case, the innermost loop (over the leftmost dimension) was fully unrolled, and the outer loop and the middle loop (over the middle dimension) were interchanged. This did not have a significant impact on performance in the full benchmark, likely because this code accounts for only a small portion of the program. However, when I extracted and ran this part separately, the interchange degraded performance by about 15%. The impact may depend on the array size.
### Category 3
```fortran
do n = 1,N0
do j = 1,N2
do i = 1,N1
A(i,j) = A(i,j) + B(i,n)*C(j)
end do
end do
end do
```
Here, the j-loop and i-loop were interchanged. As with Category 2, this did not significantly affect the performance of the full program, but in the extracted part, performance degraded by about 20%.
---
Given these observations, I still think we should disable the LoopCacheAnalysis-based heuristic by default for now. Especially when enabling loop interchange by default, avoiding performance regressions may be more important than increasing the number of applied cases.
That said, I am not strongly committed to pushing in this direction. If others feel that keeping the current behavior is preferable, I am fine with that as well. However, at a glance, the maintenance cost of LoopCacheAnalysis seems relatively high.
What do you think?
> One problem could be that we have stripped out a few thing here and there and thus made things more pessimistic, so it won't be triggering as much as it used to do and may not be showing a lot of uplifts
I don't think this is an issue with the cost model. In my opinion, there are two main reasons why loop interchange is applied less frequently than before:
- The legality check in loop-interchange. I believe some of them will be resolved by #193480 and #193481.
- The capability of DA. From what I have seen, there are inherent limits to relying solely on analysis capabilities of SCEV for overflow-related checks. Probably this is one of the biggest challenges after enabling loop-interchange by default.
https://github.com/llvm/llvm-project/pull/193478
More information about the llvm-branch-commits
mailing list