[llvm] [RFC][llvm] Added llvm.loop.vectorize.reassociate_fpreductions.enable metadata. (PR #141685)

Thu Sep 4 18:58:37 PDT 2025

================
@@ -7593,6 +7593,36 @@ Note that setting ``llvm.loop.interleave.count`` to 1 disables interleaving
 multiple iterations of the loop. If ``llvm.loop.interleave.count`` is set to 0
 then the interleave count will be determined automatically.
 
+'``llvm.loop.vectorize.reassociate_fpreductions.enable``' Metadata
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This metadata selectively allows or disallows reassociating floating-point
+reductions, which otherwise may be unsafe to reassociate, during loop
+vectorization. For example, a floating point ``ADD`` reduction without
+``reassoc`` fast-math flags may be vectorized provided that this metadata
+allows it. The first operand is the string
+``llvm.loop.vectorize.reassociate_fpreductions.enable``
+and the second operand is a bit. If the bit operand value is 1 unsafe
+reduction reassociations are enabled. A value of 0 disables unsafe
+reduction reassociations.
+
+Note that the reassociation of floating point reductions that is allowed
+by other means is considered safe, so this metadata is a no-op
+in such cases.
+
+For example, reassociation of floating point reduction
+in a loop with ``!{!"llvm.loop.vectorize.enable", i1 1}`` metadata is allowed
+regardless of the value of
+``llvm.loop.vectorize.reassociate_fpreductions.enable``.
----------------
vzakhari wrote:

Sorry for the long delay.  I finally found time to get back to this.  I promised to show how NVHPC compiler works, and I have some details now.

Nvfortran has an option `-Mvect=assoc/noassoc` that allows/disallows vectorizing FP reductions.  Nvfortran may be not a great example to demonstrate how the mix of the different options works in case of the cross-module inlining, because it looks like it just relies on whatever the options are during the compilation after the cross-module function inlining.

I tried the following example:

callee.f90:
```
subroutine inner(y,s)
  real :: y(*), s
  do j=i,100
     s=s+y(j)
  end do
end subroutine inner
```

caller.f90:
```
subroutine test(x,y,s)
  interface
     subroutine inner(y,s)
       real :: y(*), s
     end subroutine inner
  end interface
  real :: x(*), y(*), s
  do i=1,100
     call inner(y,s)
     s=s+x(i)
  end do
end subroutine test
```

The first step is to create an inlining "library" for the callee.f90: `nvfortran -cpp -O3 callee.f90 -Minfo=all -Mvect=assoc/noassoc -c -Mextract=lib:reductions`

The second step is to use the inlining "library" during the compilation of the caller.f90: `nvfortran -cpp -O3 caller.f90 -Minfo=all -Mvect=assoc/noassoc -Minline=lib:reductions -c`

Regardless of the `-Mvect=assoc/noassoc` option used during the first step, the vectorization decision is based on the option value used during the second step. I.e. `-Mvect=assoc` results in the inner loop being vectorized, and `-Mvect=noassoc` disables vectorization.

Besides the reordering of the reduction computations, nvfortran does not apply any other FP math reassociations.


The most usual use-case that I anticipate the NVHPC users may want is that most of the code is compiled with allowing FP reductions reassociation. But then some accuracy-critical loops with reductions may need to be compiled without reductions reassociation. One way to do this is to extract such loops into separate functions/module and compile them without reductions reassociation. Then after the cross-module inlining, the reduction computations within these loops are not supposed to be reassociated (even if they are loops with constant trip counts that may be completely unrolled and appear inside the outer loops existing in the caller compiled with more relaxed reduction behavior).

In this usage model, it is expected that the metadata is set to either 1 or 0 for all the loops, but how we can define the metadata merging rules?

For correctness, it sounds like the inner loops should maintain their `0` value even when completely unrolled, so `0` (or the absense of metadata) should propagate outwards and override any `1` on the outer loops. And `1` cannot be propagated outward and override any outer `0` (or the absence of metadata).

I am not sure now where such metadata propagation can be made reliably, given that different passes may do function inlining.  It does not seem feasible to require that the metadata propagation is run after each such pass that may change the loop nesting.  Can this be done in vectorizer itself by querying the whole loop nest where the loop being vectorized is located?


You brought up a great point, and I do not know how to address it properly.

I am wondering now if the approach suggested during the vectorizer meeting is more viable: (sorry, I did not remember the name of the person) suggested a FastMathFlag to be attached to FP operations that will allow their reassociation only if it is required for vectorizing reductions.  It sounds to be more consistent, but maybe someone can find drawbacks in it as well.

I think I need to collect more performance and correctness data before pushing this forward, and the LTO aspect is not a thing that I am concerned about right now.  Would that be acceptable to add an engineering option that allows reductions reassociation, so that I can experiment with multiple benchmarks and bring back some factual data? (this was one of the suggestions during the vectorizer meeting as well)

https://github.com/llvm/llvm-project/pull/141685