<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/71523>71523</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [AArch64] VLA slower than VLS (tsvc, s176)
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            backend:AArch64,
            vectorization
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          sjoerdmeijer
      </td>
    </tr>
</table>

<pre>
    Clang generates a VLA style vector loop, and GCC a VLS vector loop. It looks like we are about 50% slower as a result. Compile this input with `-O3 -mcpu=neoverse-v2 -ffast-math`:


```
__attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000],
 aa[256][256],bb[256][256],cc[256][256],tt[256][256];

int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float);


float s176()
{
    int m = 32000/2;
    for (int nl = 0; nl < 4*(100000/32000); nl++) {
        for (int j = 0; j < (32000/2); j++) {
            for (int i = 0; i < m; i++) {
                a[i] += b[i+m-j-1] * c[j];
            }
        }
        dummy(a, b, c, d, e, aa, bb, cc, 0.);
    }

}
```

Clang's codegen:

```
.LBB0_5:                                //   Parent Loop BB0_2 Depth=1
        ld1w    { z2.s }, p0/z, [x12]
        ld1b { z3.b }, p1/z, [x12, x28]
        ld1w    { z4.s }, p0/z, [x11]
 ld1b    { z5.b }, p1/z, [x11, x28]
        add     x12, x12, x23
 subs    x10, x10, x22
        fmad    z2.s, p0/m, z1.s, z4.s
 fmad    z3.s, p0/m, z1.s, z5.s
        st1w    { z2.s }, p0, [x11]
 st1b    { z3.b }, p1, [x11, x28]
        add     x11, x11, x23
 b.ne    .LBB0_5
```

vs. GCC's codegen:

```
.L3:
 ldr     q29, [x8, x0]
        ldr     q31, [x19, x0]
        ldr q30, [x9, x0]
        fmla    v31.4s, v28.4s, v29.4s
        fmla v31.4s, v27.4s, v30.4s
        str     q31, [x19, x0]
        add x0, x0, 16
        cmp     x0, x22
        bne .L3
```

See also:
https://godbolt.org/z/64nhv1o6z

TODO:
Root cause analysis.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy0VlvP4jgP_jXhxqJKHXq64AKo-PRJI81qZ7W3r5I2QJm0YZqU9_DrV0k4lRdmZ7XaqkoT-_Fjx3bbcGOabSflnCRLkpQTPtid7udmr2Vft7LZy34idP0-XynebWErO9lzKw1w-PPLAox9VxKOsrK6B6X1geAKeFfD_1YrD_l2q4zg_9ZNvhtQzXcJrxJ4L4ELPVhIKMEEjNKvsgfuHPTSDMpGsNLtoVES7K4x0HSHwcJrY3dAUjr9ymDaVoeBsLKT-ih7I6dHhOlmw42dttzuSEoJWxBaEjoeU3q6_fLlhVvbN2Kw8uWFYE4w58qlpiaYpzOCRbhhozS3wEmyZEgpJUlJcCXGy2q8rMdLOV4G_8AdJSapE14muBLiobiqHoqtfSBmy9t9N52Femjbd4K538s4GvivZJ-D_VcqLO72FcZQHRNnqa9hcdJlJygAgEtAC4SVEMLENV6onH6jeyCYO1inPI4StgzzFcwILgjmMaXB9sRRBAjBpb8LGLm8o91fWfeelGB-jSVw7X9CdUfXXOkaT9f66d8QuMu1XEOSEhyQlSD8GpftdD-Ng3wBrtP2t310y0Cyciz8JDj3Gne1E26o3FC7QfrPRdAEldfR6La8I9ZzPcuHL3EY_aeKYGag0rXcyu7-_R8bRV-WS_qSEPYpPfcXwTXBNQD8xnvZWfii9QGcMUIpD3ZHWBmPN6_q-DXEv4QPjIzfCK7g4Er94WYkWb7F6NJ7byiCFYvExSq-s8IVvGH-yPjqdfbUa3w19O7OFslTj_Ezj7yu_fMc1Dk2doKZQZigp0EfHoh370jLPY9L1SXg1s0-4iBx2znZXMDsKTi5gE-Xsc_r8Tkrxt5k5a4Ov56RADkjzxkRUSed_tx9z9v5aCL3O_0nDc0uAFB17-P44f5eIejcR0IftM0Jyq77K36C_cEuWXsG27SKu-eRxdHM1-SI-WVWuNln_C04O88Y_QQ29tfjdeV4oyclriBOx_qqPYRyPW5M0UlwaX1epW9SAldGXzK_s_Zg3Mp_NLa6FlrZSPfb8Eat01m3O8Y6_bhl-eNr-fXC8LvWFio-GAm84-rdNCaa1HNWF6zgEzmP06JgSZpk-WQ3T_OYbrJqJiuWCsH5TFIxk9km3sRUZHE2aeZIkcUxzeIYC5ZHYlPLIpdFmtGC5klCZlS2vFGRUsfWBTppjBnkPIsTZBPFhVTGHxMRBa--y64mbLFY9NXOHY_cC08Qw2mv-eC20Z2TJuWknzvGqRi2hsyoaow1Vx-2scqfPs9MSRlOluEgaHe888dIgrk1R_97CH_3YjL0an6X5cbuBhFVuiW4di5Oj-mh13tZWYJrvydDcO239VcAAAD___E7vvY">