[llvm] [ModuloSchedule] Implement modulo variable expansion for pipelining (PR #65609)

Fri Apr 26 02:03:07 PDT 2024

ytmukai wrote:

We have also evaluated other benchmarks and will share the results.

In a hotspot kernel in [ExaMiniMD](https://github.com/ECP-copa/ExaMiniMD), pipelining with MVE is effective.

ExaMiniMD is a benchmark for molecular dynamics in [ECP Proxy Applications](https://proxyapps.exascaleproject.org/). Then the [kernel](https://github.com/ECP-copa/ExaMiniMD/blob/3264e29e28e7a5a4695a959c12df6bdf03bada34/src/force_types/force_lj_neigh_impl.h#L178) calculates the LJ potential, a kind of pair potential.

It was measured on one Neoverse V1 core. The execution time of the kernel accounts for approximately 50% of the total. The number of cycles of the kernel measured by perf record is as follows (The fastest II is selected for each):

| NOSWP | SWP(II=25) | SWP+MVE(II=17) |
| --- | --- | --- |
| 15.2e9 | 15.3e9 | 14.2e9 |

(Compile flags: -Ofast -mrecip -g -mcpu=neoverse-v1 -mllvm -sve-gather-overhead=1 -mllvm -sve-scatter-overhead=1 -mllvm -aarch64-enable-pipeliner=1 -mllvm -pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm -pipeliner-enable-copytophi=0 -mllvm -pipeliner-mve-cg=0or1 -mllvm -pipeliner-force-ii=N -fno-unroll-loops)

Pipelining with MVE reduced execution time by 6.7%. (3.4% in total)

We also evaluated other benchmarks in ECP Proxy Applications, CORAL-2 Benchmarks and SPEC, but no hotspot kernels found to be effective. This was mostly due to the fact that pipelining was inherently ineffective for the kernels because of a small number instructions, etc. Some kernels could benefit if pipelining can be applied by improving other optimizations, alias analysis, inline expansion of math functions, template optimizations.

https://github.com/llvm/llvm-project/pull/65609