[llvm] [LV][AArch64] Prefer Fixed over Scalable if cost-model is equal (Neoverse V2) (PR #95819)
Sjoerd Meijer via llvm-commits
llvm-commits at lists.llvm.org
Sat Jul 6 01:25:26 PDT 2024
sjoerdmeijer wrote:
As we don't want to base codegen strategies on one benchmark only, here are two examples from production code, @paulwalker-arm , @david-arm , @davemgreen .
The first example is extracted from an HPC app, this is the SVE kernel before:
39204: a5e8438b ld1d {z11.d}, p0/z, [x28, x8, lsl #3]
39208: a5e842cc ld1d {z12.d}, p0/z, [x22, x8, lsl #3]
3920c: a5e8426d ld1d {z13.d}, p0/z, [x19, x8, lsl #3]
39210: 65cb01ab fadd z11.d, z13.d, z11.d
39214: a5e843ce ld1d {z14.d}, p0/z, [x30, x8, lsl #3]
39218: a5e84349 ld1d {z9.d}, p0/z, [x26, x8, lsl #3]
3921c: a5e842aa ld1d {z10.d}, p0/z, [x21, x8, lsl #3]
39220: 65cc01cc fadd z12.d, z14.d, z12.d
39224: a5e8416d ld1d {z13.d}, p0/z, [x11, x8, lsl #3]
39228: a5e841ee ld1d {z14.d}, p0/z, [x15, x8, lsl #3]
3922c: a5e8413f ld1d {z31.d}, p0/z, [x9, x8, lsl #3]
39230: a5e840e8 ld1d {z8.d}, p0/z, [x7, x8, lsl #3]
39234: 65eda169 fmsb z9.d, p0/m, z11.d, z13.d
39238: 65eea18a fmsb z10.d, p0/m, z12.d, z14.d
3923c: 65fe013f fmla z31.d, p0/m, z9.d, z30.d
39240: 65fe0148 fmla z8.d, p0/m, z10.d, z30.d
39244: e5e8431f st1d {z31.d}, p0, [x24, x8, lsl #3]
39248: e5e841a8 st1d {z8.d}, p0, [x13, x8, lsl #3]
3924c: 04b0e3e8 incw x8
39250: eb08023f cmp x17, x8
39254: 54fffd81 b.ne 39204
And here's the NEON kernel after which is a case where LPD/STPs is helping a lot:
39600: ad7fe4f8 ldp q24, q25, [x7, #-16]
39604: ad7fecba ldp q26, q27, [x5, #-16]
39608: f1001318 subs x24, x24, #0x4
3960c: 910080e7 add x7, x7, #0x20
39610: 910080a5 add x5, x5, #0x20
39614: 4e79d779 fadd v25.2d, v27.2d, v25.2d
39618: ad7fdeb6 ldp q22, q23, [x21, #-16]
3961c: ad7fd6f4 ldp q20, q21, [x23, #-16]
39620: 910082f7 add x23, x23, #0x20
39624: 910082b5 add x21, x21, #0x20
39628: 4e78d758 fadd v24.2d, v26.2d, v24.2d
3962c: ad7fee9a ldp q26, q27, [x20, #-16]
39630: 91008294 add x20, x20, #0x20
39634: 4ef7cf3b fmls v27.2d, v25.2d, v23.2d
39638: 4ef6cf1a fmls v26.2d, v24.2d, v22.2d
3963c: 4e7fcf75 fmla v21.2d, v27.2d, v31.2d
39640: 4e7fcf54 fmla v20.2d, v26.2d, v31.2d
39644: ad3fd6d4 stp q20, q21, [x22, #-16]
39648: 910082d6 add x22, x22, #0x20
3964c: 54fffda1 b.ne 39600
The second production code snippet is a dequantisation kernel:
void foo( unsigned char * __restrict__ A, const unsigned char * __restrict__ B, int N) {
for (int i = 0; i < N; ++i) {
A[i * 2] = (unsigned char)(B[i] & 0xf);
A[i * 2 + 1] = ((unsigned char)(B[i] & 0xf0) >> 4);
}
}
It's an example where first of all SVE prediction is unnecessary and second it leads to a code bloat:
.LBB0_9: // =>This Inner Loop Header: Depth=1
add x12, x1, x10
ld1b { z0.b }, p0/z, [x1, x10]
addvl x10, x10, #2
ld1b { z2.b }, p0/z, [x12, #1, mul vl]
lsr z1.b, z0.b, #4
and z0.b, z0.b, #0xf
lsr z3.b, z2.b, #4
and z2.b, z2.b, #0xf
st2b { z0.b, z1.b }, p0, [x11]
st2b { z2.b, z3.b }, p0, [x11, #2, mul vl]
addvl x11, x11, #4
cmp x9, x10
b.ne .LBB0_9
Even if the SVE codegen could be optimised here, it cannot compete with this NEON kernel in both code quality and performance:
.L4:
ldr q31, [x3], 16
and v30.16b, v31.16b, v29.16b
ushr v31.16b, v31.16b, 4
st2 {v30.16b - v31.16b}, [x4], 32
cmp x3, x5
bne .L4
https://github.com/llvm/llvm-project/pull/95819
More information about the llvm-commits
mailing list