[llvm] [VPlan] Add support for in-loop AnyOf reductions (PR #131830)

Wed Mar 19 05:01:29 PDT 2025

lukel97 wrote:

> @lukel97 We previously considered doing this to avoid generating i1 vp.merge, but we ultimately decided against it. In our hardware, vpop takes more cycles in the pipeline, causing snez to idle for too long. Perhaps the results vary depending on the hardware. As I mentioned in [#120405 (comment)](https://github.com/llvm/llvm-project/issues/120405#issuecomment-2569024111), another possible approach is to widen the type of vp.merge, or to retain the original vectorization method—i.e., still using select in the vector loop instead of the or operation, and choose the way depend on TTI.

Thanks for the clarification, it looks like we have a difference in microarchitectures then. On the BPI-F3, the loop in the PR description is about 10% faster with an in-loop reduction vs out-of-loop reduction:

<details><summary>Details</summary>
<p>

```
luke at bananapif3-16gb:~$ perf stat -r10 ./anyof_rdx_test.outofloop 102400000

 Performance counter stats for './anyof_rdx_test.outofloop 102400000' (10 runs):

            405.74 msec task-clock:u                     #    0.892 CPUs utilized            ( +-  7.88% )
                 0      context-switches:u               #    0.000 /sec                   
                 0      cpu-migrations:u                 #    0.000 /sec                   
           100,040      page-faults:u                    #  220.458 K/sec                  
       649,170,965      cycles:u                         #    1.431 GHz                      ( +-  7.88% )
       285,747,219      instructions:u                   #    0.39  insn per cycle           ( +-  0.02% )
         6,417,578      branches:u                       #   14.142 M/sec                    ( +-  0.00% )
             2,765      branch-misses:u                  #    0.04% of all branches          ( +-  2.09% )

            0.4550 +- 0.0320 seconds time elapsed  ( +-  7.03% )

luke at bananapif3-16gb:~$ perf stat -r10 ./anyof_rdx_test.inloop 102400000

 Performance counter stats for './anyof_rdx_test.inloop 102400000' (10 runs):

            361.55 msec task-clock:u                     #    0.995 CPUs utilized            ( +-  0.30% )
                 0      context-switches:u               #    0.000 /sec                   
                 0      cpu-migrations:u                 #    0.000 /sec                   
           100,040      page-faults:u                    #  276.211 K/sec                  
       578,469,553      cycles:u                         #    1.597 GHz                      ( +-  0.30% )
       234,494,827      instructions:u                   #    0.40  insn per cycle           ( +-  0.00% )
         6,417,578      branches:u                       #   17.719 M/sec                    ( +-  0.00% )
             2,529      branch-misses:u                  #    0.04% of all branches          ( +-  1.66% )

           0.36327 +- 0.00108 seconds time elapsed  ( +-  0.30% )
```

</p>
</details> 

But since this PR only adds support, should we move the discussion on enabling it/adding a tuning flag to another PR? 

https://github.com/llvm/llvm-project/pull/131830