[flang-commits] [flang] [flang] still apply vectorization cost model with IVDEP (PR #180760)

Wed Feb 11 07:48:42 PST 2026

jeanPerier wrote:

> In my testing classic flang does set `!{!"llvm.loop.vectorize.enable", i1 true}` for

@tblah, you are right, sorry, I was looking at code generated by NV/PGI rather than classic flang itself. I see that @kiranchandramohan implemented IVDEP in classic flang here https://github.com/flang-compiler/flang/pull/660.

I still think that this may be an oversight and is not the best behavior for the user because then there is no way for a user to tell the compiler "assume loop is safe but still use cost model". From my testing, both nvfortran and ifx are not using IVDEP to override the cost model (you can see for instance than when it is not physically possible to vectorize a loop, IFX will raise a warning that it could not do it when the loop has "vector always", but it will not raise such warning when only "IVDEP" is here).

Here are some extract from the existing documentation about IVDEP:

[IFX](https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/ivdep.html): "The optimizer **can** use this information"
[gfortran](https://gcc.gnu.org/onlinedocs/gfortran/IVDEP-directive.html): The purpose of the directive is to tell the compiler that vectorization is safe.
[cray/hpe](https://cpe.ext.hpe.com/docs/latest/cce/man7/ivdep.7.html): "Whether or not IVDEP is used, conditions other than vector dependencies can inhibit vectorization."

These documentations are not telling that IVDEP will force the compiler to vectorize, in fact, if you read the [IFX documentation for VECTOR ALWAYS](https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/vector-and-novector.html), it is pretty clear that IVDEP and VECTOR ALWYAS should be compiled if the user really wants to force the hand of the compiler to vectorize a loop by bypassing both the safety and cost analysis. 

In the user application that is motivating my patch, the code contains IVDEP on a loop because it has some access with offsets `a(i) = a(i+k)` so some help from the user is needed to guarantee the `i` and `i+k` will not overlap. However, it is only beneficial to vectorize the loop for REAL(4), and the application can be compiled with both REAL(4)/REAL(8) by twitching some parameter. Hence, with flang, optimizations are actually slowing down the application when it is compiled with REAL(8).

While I cannot share that code, here is a mini test that shows that not being able to tell "assume loop is safe but still use cost model" lead to worsening performance. Attached is a small benchmark [ivdep_benchmark.tar.gz](https://github.com/user-attachments/files/25237991/ivdep_benchmark.tar.gz) I made that contains a kernel that cannot vectorize without IVDEP because of potential conflicting access (`a(i+offset)`). The kernel in itself is also not profitable to vectorize because it contains a branch with an expensive path. However, inlining may allow removing the branch. With our current implementation of IVDEP, there is no way for a user that would know the accesses are safe to vectorize the inlined kernels while still not vectorizing the ones that could not be inlined.

The kernel in Kernel.F90 looks like:
```
  do i = 1, n
     if (a(i+offset) > 0.99 .and.cdt) then
        a(i) = sin(a(i)) * cos(c(i)) + exp(b(i))  ! Expensive branch (rare case)
     else
        a(i) = b(i)*a(i+offset)  ! Cheap branch (common case)
     end if
  end do
```

Here are the results of measured on a x86-64 zen2 architecture running the benchmark with `./run.sh flang "-march=native"` without this patch

```
---------------------------------------------------
Running configuration: O3 (-O3 )
non_beneficial_vectorization:  1.8908422  seconds
get_rid_of_branch_after_inlining:  .68996453  seconds
---------------------------------------------------
Running configuration: O3_IVDEP (-O3 -DUSE_IVDEP)
non_beneficial_vectorization: 11.062648  seconds
get_rid_of_branch_after_inlining: .37085533
---------------------------------------------------
Running configuration: O3_VEC_ALWAYS (-O3 -DUSE_VECTOR_ALWAYS)
non_beneficial_vectorization: 1.9178097  seconds
get_rid_of_branch_after_inlining:  .8120601  seconds
---------------------------------------------------
Running configuration: O3_IVDEP_VEC_ALWAYS (-O3 -DUSE_IVDEP -DUSE_VECTOR_ALWAYS)
non_beneficial_vectorization: 11.073077  seconds
get_rid_of_branch_after_inlining: .37607098  seconds
```

As you can see, vectorizing the kernel that was not inlined and where the branch was not pruned is causing a huge penalty (from 2s to 11s). However, vectorizing the the case where the branch is pruned is beneficial (0.7s to 0.3s).

However, there is no way to tell that to the compiler with IVDEP with the current implementation and using it results in a total time of 11+0.4=11.4s, so a big slow down compare to the O3 base case at 1.9+0.7=2.6s.

With the current patch, the IVDEP runs becomes 1.9+0.4=2.3s, so a 13% speed-up over the O3 base case.
```
Running configuration: O3_IVDEP (-O3 -DUSE_IVDEP)
non_beneficial_vectorization: 1.8979993  seconds
get_rid_of_branch_after_inlining: .36635113
```
The other cases are unchanged.

So to me, the current implementation is not in line with at least ifx/NV and the IVDEP documentation in general, but it is also not the one that gives the most flexibility to the user.

https://github.com/llvm/llvm-project/pull/180760