[llvm] [LV][AArch64] Prefer Fixed over Scalable if cost-model is equal (Neoverse V2) (PR #95819)

Tue Jun 18 03:06:09 PDT 2024

sjoerdmeijer wrote:

This is actually a quite complicated story, it's a combination of a few factors: a few micro-architectural reasons and (SVE) codegen reasons. To give a better introduction to the problem, we have a number of examples similar to this:

     for (int i = 0; i < 32000/2; i++) {
            a[i+k] = a[i] + b[i];

This is GCC's output, and LLVM's output with this patch:

       .L3:
            ldr     q31, [x20, x0]
            ldr     q30, [x19, x0]
            fadd    v31.4s, v31.4s, v30.4s
            str     q31, [x21, x0]
            add     x0, x0, 16
            cmp     x0, x28
            bne     .L3

LLVM's output is something like this:

       .LBB0_3:
            add     x9, x19, x8, lsl #2
             add     x10, x20, x8, lsl #2
             ld1w    { z0.s }, p0/z, [x19, x8, lsl #2]
             ld1w    { z2.s }, p0/z, [x20, x8, lsl #2]
             add     x8, x8, x21
             ld1w    { z1.s }, p0/z, [x9, x28, lsl #2]
             ld1w    { z3.s }, p0/z, [x10, x28, lsl #2]
             add     x10, x9, x26
             cmp     x8, x22
             fadd    z0.s, z2.s, z0.s
             fadd    z1.s, z3.s, z1.s
             st1w    { z0.s }, p0, [x9, x23, lsl #2]
             st1w    { z1.s }, p0, [x10, x28, lsl #2]
             b.ne    .LBB0_3

There is nothing fundamentally wrong with LLVM's codegen, but it performs a lot worse. 

One of the micro-architectural reasons are documented in section "4.1 Dispatch constraints" of the SWOG:

> The dispatch stage can process up to 8 MOPs per cycle and dispatch up to 16 μOPs per cycle,

The smaller kernels fit in these dispatch constraints, the bigger ones don't, resulting in significant performance differences.

Most of the performance can be clawed back by interleaving more. But then there's clearly a code quality issue: the amount of code necessary to get on par with the NEON kernel would be disproportional. 

Two more subjective arguments:
- there is no need to go predicated for these kind of examples,
- GCC prefers this codegen strategy for the same reasons.

Other factors are slightly more complicated SVE addressing modes, also resulting in more MOPS.

We are investigating other micro-architectural issues, but I cannot comment on this yet. 

https://github.com/llvm/llvm-project/pull/95819