<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/71524>71524</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[AArch64] VLA slower than VLS (tsvc, s1111)
</td>
</tr>
<tr>
<th>Labels</th>
<td>
backend:AArch64,
vectorization
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
sjoerdmeijer
</td>
</tr>
</table>
<pre>
We are about 25% behind with Clang compared to GCC12 on Grace for kernel s1111 in TSVC. The difference seems to be related to Clang generating a VLA loop, and GCC a simpler VLS loop.
Compile this input with `-O3 -mcpu=neoverse-v2 -ffast-math`:
```
__attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000],
aa[256][256],bb[256][256],cc[256][256],tt[256][256];
int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float);
float s1111()
{
for (int nl = 0; nl < 2*100000; nl++) {
for (int i = 0; i < 32000/2; i++) {
a[2*i] = c[i] * b[i] + d[i] * b[i] + c[i] * c[i] + d[i] * b[i] + d[i] * c[i];
}
dummy(a, b, c, d, e, aa, bb, cc, 0.);
}
}
```
Clang's codegen:
```
.LBB0_3: // Parent Loop BB0_2 Depth=1
ld1w { z4.s }, p0/z, [x19, x8, lsl #2]
ld1w { z2.s }, p0/z, [x21, x8, lsl #2]
ld1w { z3.s }, p0/z, [x20, x8, lsl #2]
add x8, x8, x23
cmp x28, x8
fadd z5.s, z4.s, z4.s
fmul z5.s, z3.s, z5.s
fadd z3.s, z3.s, z2.s
fadd z3.s, z3.s, z4.s
lsl z4.d, z0.d, #1
add z0.d, z0.d, z6.d
fmad z2.s, p0/m, z3.s, z5.s
lsl z3.d, z1.d, #1
add z1.d, z1.d, z6.d
uunpklo z5.d, z2.s
uunpkhi z2.d, z2.s
st1w { z5.d }, p1, [x22, z4.d, lsl #2]
st1w { z2.d }, p1, [x22, z3.d, lsl #2]
b.ne .LBB0_3
```
vs. GCC's codegen:
```
.L3:
ldr q29, [x25, x0]
mov x7, x4
add x6, x4, 16
add x5, x4, 24
add x4, x4, 32
ldr q30, [x24, x0]
ldr q28, [x23, x0]
add x0, x0, 16
mov v31.16b, v29.16b
fmla v31.4s, v30.4s, v27.4s
fadd v30.4s, v30.4s, v29.4s
fmul v31.4s, v31.4s, v28.4s
fmla v31.4s, v30.4s, v29.4s
str s31, [x7], 8
st1 {v31.s}[1], [x7]
st1 {v31.s}[2], [x6]
st1 {v31.s}[3], [x5]
cmp x0, x26
bne .L3
```
See also:
https://godbolt.org/z/fnfa9eb3E
TODO:
Root cause analysis.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy0V1uPozoS_jXOS6kRlIGEhzzksj0vLc1qZzT72DK4kjBjcAabzEz_-pXNJSGX3j46Ohbyperz53K5yoAwptzXREuWrFmynYnWHnSzNN81NbKi8js1s1zLP8v_EoiGQOS6tYAJwwRyOpS1hF-lPcBGiXoPha6OoiEJVsOnzSZC0DV8akRBsNMN_KCmJgUmiqIIyhq-fvm2CeDrgUCWux01VBcEhqgyjiAnaEgJ29F1C-yppkbYst6DgG8vK1BaHxluQNTSrQgCTFkdFTXw7eWL1wYs3LJw1dUbXR1LRWAPpYGyPra2M5-l4dNnDk9VcWwZ39akT9QYejohPO12wtinStgDS0PGV5d8TtI9fvj6Kqxtyry19PrKcMFwIZTzr2S4SGOGWffATmlhQbBkzTEMQ5ZsGW7y6bCYDuV0SNNhtz4IR4lJ6oRjBzd5fldcFHfF1t4R8_Xlvsvagmyr6g_Dhd_L1Br4p2S3xv4tFWZX--rOxQeoP76sV857FAD4SGa4cC6oFTC-hZDxddffADJcRaErnZDh2j8ZTEiuiMozT-lpus3jM3rROySu-ENnuCpZsvVE7ly7Aa4gPw_WIB9qJnOKj82R9-aMLh3MY_NtLxhCRrgjyF1VuEq6inwad5pO5XVhcHlKE7pzZ5qFfaq7C4Ph3EChJe2pfj91g5f1OnzljE88e68wfGb4DAD_Fg3VFl60PoKbjLCloz0wvo2mDlAy-tVZvoa3ODB-C7iBozvhN9djyfp3lLne74WrlVHAkKNzZ0c14cBHHBi9x3HPHP6QKvwAlZDStx2sr5FPQUV1hN84IKYp0BO8JYFxeuedsZ0iq1ZNkLxvkxvkwMmvkPhh5M3qzgEeGQc-WN_CrmXIo_seGRBjmwbyekeiWx27Vb37q_f3NtrBe9bo_9oRTZG3drRtffyhtFtNXvnJqw6lE12r-mLsRSwlgRxjKRrjCHuXyvuRNGHAdxj4I4a-5EFNMKTx43vhZAL3pfBXbgY-As4p1MBPzEYLEx_c4Y1NlT51-TH3gPhB9qS9FjcQpcOrfFAmZyU-IojPGI63prryk4ejufEDc0dsl60eyx9gx7XDHnBp_LDvE4-CKPXX-Qkz373KASUGXOxj_sTDoYdz1-vwQ6peqC-A2Rl4dV1cEo89XNzBv2fILb-xnacMH8N03n9aLG4yBPr4duTGhXeyjnrwOPEDc_BiTvrBOfxiTnIzx93MF2eIw_HlNXm5C_3HmfSFCIQyesyOg7VH40b-DbnXMtfKBrrZd--V5129Exnl_F-XLF8_bz-PDP_R2kIhWkMgaqH-mNIEM7nkMuOZmNEySrOMJ3NMs9lhKReUizhGntI8ThJKUimjcLHDghOiFLNyiSHyKArnURRHYRakUUb5fFEITBZZTCmLQ6pEqQKlTpUzdFYa09JyHiUYz5TISRn_Z4SYi-IH1ZLx1WrVFAf3Me-uJYZ4osLqpnwTttS1kybbWbN0jE95uzcsDlVprDmvYUur_A_XwJRs_b-MUfoXNWAPovZ_LwwX1pz8V1D_RZrN2kYtr9xc2kObB4WuGD67Nfrm6djo71RYhs9-U4bhs9_X_wIAAP__FLJ4xA">