[llvm] [RISCV][TTI] Reduce cost of a build_vector pattern (PR #108419)

Thu Sep 19 23:28:59 PDT 2024

lukel97 wrote:

Here's the benchmark diffs with LTO: https://lnt.lukelau.me/db_default/v4/nts/8?show_delta=yes&show_previous=yes&show_stddev=yes&show_mad=yes&show_all=yes&show_all_samples=yes&show_sample_counts=yes&show_small_diff=yes&num_comparison_runs=0&test_filter=&test_min_value_filter=&aggregation_fn=min&MW_confidence_lv=0.05&compare_to=7

The povray regression is gone, the only noticeable change left is a 2.08% regression in 541.leela_r. In the hottest function `FastState::play_random_move()` there's a few more places where we now vectorize to a vredor.vs:

```asm
	vsetivli	zero, 0x1, e64, m1, ta, ma
	vmv.s.x	v8, a2
      	vsetivli	zero, 0x4, e16, mf4, ta, ma
	vmseq.vi	v0, v8, 0x1
      	slliw	a2, t0, 0x8
      	vsetvli	zero, zero, e32, mf2, ta, mu
      	vmv.v.i	v8, 0x0
      	ld	a4, 0x38(sp)
      	vle32.v	v8, (a4), v0.t
      	slliw	a4, a6, 0x6
	slliw	a5, t1, 0x4
      	slliw	a3, a3, 0x2
	vredor.vs	v8, v8, v8
	vmv.x.s	s1, v8
```

We're doing a packed e16 build vector which is then converted to a mask vector. 
I think we can avoid the vmv.v.i if we move the masking from the vle32.v to the vredor.vs. But that's orthogonal to this PR and I don't think the vectorized code is bad per say, so I think this is fine. 

https://github.com/llvm/llvm-project/pull/108419