[llvm] [ModuloSchedule] Implement modulo variable expansion for pipelining (PR #65609)

Fri Apr 12 08:14:06 PDT 2024

ytmukai wrote:

I have updated the patch; interface fixes and AArch64 implementation. Could you please restart the review?

* Replace a parameter `RegMap` to `LastStage0Insts` in `createRemainingIterationsGreaterCondition()`. These maps are used to obtain the loop counter value at that point. We found that `RegMap` is not sufficient if that register is defined by PHI, so it was changed to an instruction map.
* Implement interfaces of MVE for AArch64. The main part is to determine whether the loop ends at the unrolled kernel. It is implemented in a way that generates comparison instructions for the next unroll count iteration and accumulates the results with CINC instructions. Although the number of instructions is large, it can handle many forms of loops and is not a significant problem for large kernels where pipelining is effective.

I tested the performance with llvm-test-suite on a Neoverse V1 processor. The following table lists the most effective test cases with pipelining. All results are uploaded as [swp.performance.llvm-test-suite.csv](https://github.com/llvm/llvm-project/files/14960352/swp.performance.llvm-test-suite.csv).

| test case                                                        | NOSWP |  SWP | SWP+MVE | SWP+faster | speedup by SWP(faster) | speedup by MVE |
| ---------------------------------------------------------------- | ----: | ---: | ------: | ---------: | ---------------------: | -------------: |
| SingleSource/Benchmarks/Misc/flops-3.test                        |  0.67 | 0.82 |    0.54 |       0.54 |                  24.9% |          53.2% |
| MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test |  2.59 | 2.61 |    2.24 |       2.24 |                  15.6% |          16.5% |
| MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test |  2.62 | 2.63 |    2.28 |       2.28 |                  14.7% |          15.1% |
| SingleSource/Benchmarks/Misc/flops-1.test                        |  0.54 | 0.68 |    0.47 |       0.47 |                  13.6% |          43.1% |
| SingleSource/Benchmarks/Adobe-C++/loop_unroll.test               |  0.46 | 0.41 |    0.41 |       0.41 |                  12.7% |           0.0% |
| SingleSource/Benchmarks/Misc/flops.test                          |  2.66 | 2.79 |    2.41 |       2.41 |                  10.3% |          15.7% |

Optimization flags:
* NOSWP: -O3 -mcpu=neoverse-v1
* SWP: NOSWP + -mllvm -aarch64-enable-pipeliner -mllvm -pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm -pipeliner-enable-copytophi=0 -mllvm -pipeliner-register-pressure
* SWP+MVE: SWP + -mllvm -pipeliner-mve-cg
* SWP+faster is the faster one for each cases.

In many of the above cases, the effect of MVE is significant and the csv shows that it is better in general. I believe that MVE should be the default in AArch64.

Note: Cases with unstable runtimes and micro-benchmarks are excluded. The micro-benchmarks are mostly the same and are not suitable for pipelining because its kernel is very small.

https://github.com/llvm/llvm-project/pull/65609