[llvm] [ModuloSchedule] Implement modulo variable expansion for pipelining (PR #65609)

Tue Nov 21 05:38:37 PST 2023

ytmukai wrote:

In order to verify the effect of MVE in more detail, I tested it in a scientific computation loop, for which software pipelining would be useful. The tested code is the following loop from [IAMR](https://proxyapps.exascaleproject.org/app/iamr/), one of the ECP proxy applications, with some modifications.
https://github.com/AMReX-Codes/amrex/blob/9e35dc19489dc5d312e92781cb0471d282cf8370/Src/LinearSolvers/MLMG/AMReX_MLNodeLap_2D_K.H#L584
(The modifications include removing branches and adjusting the number of iterations so that accesses hit the L1 cache to make it easier to check the effects of the optimization, which may change the conditions from the actual application execution. I am working on making the modified code publicly available.)

The modified code was measured with Graviton3 (Neoverse V1) and the results are as follows.

| -pipeliner-force-ii=N | cycles (no mve) | instructions (no mve) | cycles (mve) | instructions (mve) | cycles (no swpl) | instructions (no swpl) |
| --------------------: | --------------: | --------------------: | -----------: | -----------------: | ---------------: | ---------------------: |
|                  - |               -  |                    -   |         -     |               -     |             18.0 |                   62.6 |
|              11 (MII) |            29.3 |                 134.4 |         19.6 |              100.7 |                - |                      - |
|                    12 |            22.1 |                 119.0 |         18.1 |               95.1 |                - |                      - |
|                    13 |            20.0 |                 103.8 |         17.6 |               89.8 |                - |                      - |
|                    14 |            17.8 |                  92.1 |         17.1 |               86.4 |                - |                      - |
|                    15 |            17.3 |                  85.3 |         16.6 |               80.4 |                - |                      - |
|                    16 |            18.4 |                  89.1 |         16.4 |               75.0 |                - |                      - |
|                    17 |            17.5 |                  80.3 |         15.6 |               67.6 |                - |                      - |
|                    18 |            17.2 |                  76.4 |         15.7 |               65.6 |                - |                      - |
|                    19 |            16.6 |                  70.4 |     __15.5__ |               64.1 |                - |                      - |
|                    20 |            16.5 |                  69.5 |         15.7 |               64.6 |                - |                      - |
|                    21 |            16.6 |                  69.5 |         15.7 |               64.6 |                - |                      - |
|                    22 |        __15.9__ |                  67.6 |         15.7 |               64.6 |                - |                      - |
|                    23 |            16.5 |                  67.5 |         16.1 |               64.6 |                - |                      - |
|                    24 |            16.5 |                  67.6 |         16.0 |               64.6 |                - |                      - |
|                    25 |            16.5 |                  66.5 |         16.2 |               64.6 |                - |                      - |

The table compares the number of cycles and instructions per iteration with and without MVE for each scheduled II. The number of cycles is reduced by 3% by MVE when comparing the fastest results for each. The reduction is 14% compared to the case without pipelining.

The number of floating-point instructions per iteration of the loop is 43 (division is reciprocally approximated by -mrecip) and Neoverse V1 has four arithmetic units, so ResMII is 11. With small II, the performance is reduced by spills due to lack of register.
 This result indicates that the number of registers should also be considered when determining II. My colleague @kasuga-fj is working on this issue (https://discourse.llvm.org/t/considering-register-pressure-when-deciding-initiation-interval-in-machinepipeliner/74725).

I would appreciate any advice for further verification.

https://github.com/llvm/llvm-project/pull/65609