[llvm] [RISCV][TTI] Reduce cost of a build_vector pattern (PR #108419)

Wed Sep 18 09:54:06 PDT 2024

lukel97 wrote:

> Just to confirm, you're looking at cycle count right? What routine are you seeing this in? I'm looking at an LTO build of povray, and not seeing any heavy use of the @exp routine - except indirectly through a function pointer table. Is your build -Ofast -flto=auto? Or something else?

This build was with -O3 -mcpu=spacemit-x60, I'll queue up another run with -Ofast and LTO.

One example I found was in `pov::compute_backtrace_texture(float*, pov::Texture_Struct*, double*, double*, pov::Ray_Struct*, double, pov::istk_entry*) (_ZN3povL25compute_backtrace_textureEPfPNS_14Texture_StructEPdS3_PNS_10Ray_StructEdPNS_10istk_entryE)`

```diff
-       flw     fa5, 36(s1)
-       fld     fs3, %pcrel_lo(.Lpcrel_hi250)(a1)
-       fld     fs4, 0(s10)
-       fcvt.d.s        fa5, fa5
-       fsub.d  fa5, fs3, fa5
-       fneg.d  fa5, fa5
-       fmul.d  fa5, fs4, fa5
-       fdiv.d  fa0, fa5, fs2
-       call    exp
-       flw     fa5, 40(s1)
-       fmv.d   fs1, fa0
-       fcvt.d.s        fa5, fa5
-       fsub.d  fa5, fs3, fa5
-       fneg.d  fa5, fa5
-       fmul.d  fa5, fs4, fa5
-       fdiv.d  fa0, fa5, fs2
-       call    exp
-       vsetivli        zero, 2, e64, m1, ta, ma
-       vfmv.v.f        v8, fs1
-       flw     fa5, 44(s1)
-       vfslide1down.vf v8, v8, fa0
-       vfmul.vf        v8, v8, fs0
-       vsetvli zero, zero, e32, mf2, ta, ma
-       vfncvt.f.f.w    v9, v8
-       csrr    a0, vlenb
-       add     a0, a0, sp
-       addi    a0, a0, 2047
-       addi    a0, a0, 65
-       vs1r.v  v9, (a0)                        # Unknown-size Folded Spill
-       fcvt.d.s        fa5, fa5
-       fsub.d  fa5, fs3, fa5
-       fneg.d  fa5, fa5
-       fmul.d  fa5, fs4, fa5
-       fdiv.d  fa0, fa5, fs2
-       call    exp
-       csrr    a0, vlenb
-       add     a0, a0, sp
-       addi    a0, a0, 2047
-       addi    a0, a0, 65
-       vl1r.v  v9, (a0)                        # Unknown-size Folded Reload
-       fmul.d  fa5, fa0, fs0
-       fcvt.s.d        fs0, fa5
-       vsetivli        zero, 2, e32, mf2, ta, ma
+       flw     fa5, 36(s1)
+       fld     fs4, %pcrel_lo(.Lpcrel_hi250)(s0)
+       fld     fs5, 0(s10)
+       fcvt.d.s        fa5, fa5
+       fsub.d  fa5, fs4, fa5
+       fneg.d  fa5, fa5
+       fmul.d  fa5, fs5, fa5
+       fdiv.d  fa0, fa5, fs2
+       call    exp
+       flw     fa5, 40(s1)
+       fmul.d  fa4, fa0, fs1
+       fcvt.s.d        fs0, fa4
+       fcvt.d.s        fa5, fa5
+       fsub.d  fa5, fs4, fa5
+       fneg.d  fa5, fa5
+       fmul.d  fa5, fs5, fa5
+       fdiv.d  fa0, fa5, fs2
+       call    exp
+       flw     fa5, 44(s1)
+       fmul.d  fa4, fa0, fs1
+       fcvt.s.d        fs3, fa4
+       fcvt.d.s        fa5, fa5
+       fsub.d  fa5, fs4, fa5
+       fneg.d  fa5, fa5
+       fmul.d  fa5, fs5, fa5
+       fdiv.d  fa0, fa5, fs2
+       call    exp
+       fmul.d  fa5, fa0, fs1
```

But at the same time, in `pov::do_light(pov::Light_Source_Struct*, double*, pov::Ray_Struct*, pov::Ray_Struct*, double*, float*) (_ZN3povL8do_lightEPNS_19Light_Source_StructEPdPNS_10Ray_StructES4_S2_Pf)` we actually go in the other direction

```diff
-       fneg.d  fa5, fa5
-       fsd     fa5, 24(s0)
-       fneg.d  fa5, fa4
-       fsd     fa5, 32(s0)
+       vsetivli        zero, 2, e64, m1, ta, ma
+       vfmv.v.f        v8, fa5
+       vfslide1down.vf v8, v8, fa4
+       vfneg.v v8, v8
+       vse64.v v8, (s5)
```

Unfortunately none of the hot *_Intersection methods seem to be affected, instead it's a large number of cold functions that are slightly perturbed. 

I'm really not sure how to interpret these changes. If the rest of the SPEC benchmarks are OK, I would be fine just chalking this up to SLP "noise". 

https://github.com/llvm/llvm-project/pull/108419