[llvm] [VPlan] Add support for in-loop AnyOf reductions (PR #131830)
Luke Lau via llvm-commits
llvm-commits at lists.llvm.org
Wed Mar 19 05:01:29 PDT 2025
lukel97 wrote:
> @lukel97 We previously considered doing this to avoid generating i1 vp.merge, but we ultimately decided against it. In our hardware, vpop takes more cycles in the pipeline, causing snez to idle for too long. Perhaps the results vary depending on the hardware. As I mentioned in [#120405 (comment)](https://github.com/llvm/llvm-project/issues/120405#issuecomment-2569024111), another possible approach is to widen the type of vp.merge, or to retain the original vectorization method—i.e., still using select in the vector loop instead of the or operation, and choose the way depend on TTI.
Thanks for the clarification, it looks like we have a difference in microarchitectures then. On the BPI-F3, the loop in the PR description is about 10% faster with an in-loop reduction vs out-of-loop reduction:
<details><summary>Details</summary>
<p>
```
luke at bananapif3-16gb:~$ perf stat -r10 ./anyof_rdx_test.outofloop 102400000
Performance counter stats for './anyof_rdx_test.outofloop 102400000' (10 runs):
405.74 msec task-clock:u # 0.892 CPUs utilized ( +- 7.88% )
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
100,040 page-faults:u # 220.458 K/sec
649,170,965 cycles:u # 1.431 GHz ( +- 7.88% )
285,747,219 instructions:u # 0.39 insn per cycle ( +- 0.02% )
6,417,578 branches:u # 14.142 M/sec ( +- 0.00% )
2,765 branch-misses:u # 0.04% of all branches ( +- 2.09% )
0.4550 +- 0.0320 seconds time elapsed ( +- 7.03% )
luke at bananapif3-16gb:~$ perf stat -r10 ./anyof_rdx_test.inloop 102400000
Performance counter stats for './anyof_rdx_test.inloop 102400000' (10 runs):
361.55 msec task-clock:u # 0.995 CPUs utilized ( +- 0.30% )
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
100,040 page-faults:u # 276.211 K/sec
578,469,553 cycles:u # 1.597 GHz ( +- 0.30% )
234,494,827 instructions:u # 0.40 insn per cycle ( +- 0.00% )
6,417,578 branches:u # 17.719 M/sec ( +- 0.00% )
2,529 branch-misses:u # 0.04% of all branches ( +- 1.66% )
0.36327 +- 0.00108 seconds time elapsed ( +- 0.30% )
```
</p>
</details>
But since this PR only adds support, should we move the discussion on enabling it/adding a tuning flag to another PR?
https://github.com/llvm/llvm-project/pull/131830
More information about the llvm-commits
mailing list