[llvm] [LV][AArch64] Prefer Fixed over Scalable if cost-model is equal (Neoverse V2) (PR #95819)

Tue Jun 18 03:23:13 PDT 2024

sjoerdmeijer wrote:

> > There is nothing fundamentally wrong with LLVM's codegen, but it performs a lot worse.
> 
> Does that in fact suggest for this case that we should be interleaving less in LLVM and be tuning that accordingly? GCC's output does use NEON, but there is no interleaving. Have you tried disabling interleaving when using SVE in LLVM so that it's a fairer comparison with GCC? I'd be interested to know how that performs.

Yes, we have experimented also with that, e.g. nano-benchmarked this:

    loop:
       ld1w  {z0.s}, p0/z, [x1, x9, lsl #2]
       ld1w  {z1.s}, p0/z, [x2, x9, lsl #2]
       incw  x9
       cmp   x9, x10
       fadd  z0.s, z1.s, z0.s
       st1w  {z0.s}, p0, [x3, x9, lsl #2]
       b.ne  loop:

where we hoisted the adds out of the loop, and this also performs worse.

This is the different micro-architectural reason I was talking about. We do have a suspicion what this could be, but cannot confirm yet.

So, I hope I have given enough examples that demonstrates that unexpected performance results happen, not necessary SVEs fault, but there's a higher change of hitting micro-architectural issues and secondary effects for this class of small kernels and loops. I think there's also an element of "keeping it simple" to it if you see what I mean. 

https://github.com/llvm/llvm-project/pull/95819