[PATCH] D95800: [RISCV] Make scalable vector FMA commutable for register allocation.

Tue Feb 2 10:03:27 PST 2021

craig.topper added inline comments.

================
Comment at: llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td:1778
+    // Add a commutable version for use by IR fma.
+    // FIXME: If we didn't have the tail undisturbed policy on the intrinsic
+    // version, we wouldn't need this.
----------------
frasercrmck wrote:
> Is this a FIXME, as in, are we likely to change the policy on the intrinsics?
> 
> Also I'm still a bit unclear on what the tail policy has to do with commuting operands. What am I missing?
We're currently using tail undisturbed policy on any instruction with a sourced tied to a destination.

I believe the example I was shown where someone expected this to work was something like this.

```
float foo(float *src1, float *src2, size_t n) {       
  size_t len;                    
  len = vsetvlmax_e32m8();
  vfloat32m8_t v16 = vfmv_v_f_f32m8(0.0, len);
  len = vsetvl_e32m1();
  vfloat32m1_t v24 = vfmv_s_f_f32m1(vundefined_f32m1(), 0.0, len);
  for (; (len = vl_extract(vsetvl_e32m8(n))) > 0; n -= len) {
    vfloat32m8_t v0 = vle32_v_f32m8(src1, len);
    vfloat32m8_t v0 = vle32_v_f32m8(src2, len);
    v16 = vfmacc_vv_f32m1(v16, v0, v8, len);
    src1 += len;
    src2 += len;                                                                                                                                                                         
  }
  len = vsetvlmax_e32m8();                                                                                                                                                                                                                                                                                                       
  vfloat32m1_t result = vfredosum_vs_f32m8_f32m1(v16, v24, len);                                                                                                                                                                                                                                                                
  return vfmv_f_s_f32m1_f32(result);                                                                                                                                                                                                                                                                                       
}
```

On the last loop iteration, len might be less than vlmax and the code depends on the tail elements of v16 being preserved from the previous iterations. After the loop a reduction is done using vlmax that will access those elements.

If we commute fmacc to fmadd, then the register used for the v16 input will not be tied to the output register used for v16 for the fmacc.  This would prevent the tail elements from being preserved.

I'm not sure we should be allowing this code to work, but tail agnostic is a valid implementation of tail undisturbed. So even if we picked tail agnostic this code might work on in order CPUs and then break in the future on an out of order CPU.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D95800/new/

https://reviews.llvm.org/D95800