<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/71511>71511</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [AArch64]  Missed vectorisation opportunity (tsvc, s112)
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            backend:AArch64,
            vectorization
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          sjoerdmeijer
      </td>
    </tr>
</table>

<pre>
    With Clang top of tree, we are about 35% behind on our AArch64 platform compared to GCC12 for the s122 kernel from TSVC: GCC vectorises the kernel, Clang doesn't. Clang seems to think it's not worthwhile vectorising this input with -O3 -ffast-math -mcpu=neoverse-v2:

```
__attribute__((aligned(64))) float x[32000];

__attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000],
 aa[256][256],bb[256][256],cc[256][256],tt[256][256];

int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float);

float s112()
{
    for (int nl = 0; nl < 3*100000; nl++) {
        for (int i = 32000 - 2; i >= 0; i--) {
            a[i+1] = a[i] + b[i];
 }
        dummy(a, b, c, d, e, aa, bb, cc, 0.);
 }
}
```

Clang's codegen:

```
.LBB0_3: //   Parent Loop BB0_2 Depth=1
        add     x10, x19, x8
        ldr     s0, [x9, #-8]!
        sub     x8, x8, #8
 ldur    s1, [x10, #-4]
        fadd    s2, s1, s0
        ldur    s0, [x9, #-4]
        ldr     s1, [x19, x8]
        fadd    s0, s1, s0
 stp     s0, s2, [x9]
        cbnz    x8, .LBB0_3
```

whereas GCC generates:

```
.L4:
 ldr     q30, [x27, x0]
        ldr     q28, [x20, x0]
        mov v31.16b, v30.16b
        mov     v29.16b, v28.16b
        tbl v30.16b, {v30.16b - v31.16b}, v27.16b
        tbl     v28.16b, {v28.16b - v29.16b}, v27.16b
        fadd    v30.4s, v30.4s, v28.4s
        mov v31.16b, v30.16b
        tbl     v30.16b, {v30.16b - v31.16b}, v27.16b
        str     q30, [x19, x0]
        sub     x0, x0, #16
 cmp     x0, x28
        bne     .L4
```

See https://godbolt.org/z/6cfK9sbj5 for the reproducer (and this codegen).


</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy0Vl-PozYQ_zTOyygRjEOAhzxskqYPvaqVrmofVwZPgvcAc7bJ7u2nr2wImz-7V12rImR7xsNv5jceGwtr1bElWrNkw5LdTPSu0mZtnzQZ2ZB6IjMrtPy2_ku5Cra1aI_gdAf6AM4QMdzCM4EwBKLQvQOeMEygoEq1EnQLujfw8GDKarWErhbuoE0DpW46YUiC0_DzdhsjHLQBVxHYGBG-kGmphoPRDfzx-c8t4w_eDE5UOm2UJRtsBzMfwRCW1GRbhqlbjApL1Fjvw1Wq_QLKMUwttNrBszaueq5UTROo8sQqZUG1Xe_g2dOd_8ZhfjgI6-aN8HJTdj3ju5b0iYyl-QkZf2DRjkXndhWNbxAfH4VzRhW9o8dHhhnDTNQ-35JhtloyzIcXDrUWDl5YsuEYRRFLdoxvLoF_EElcIuG2uBbLa1Fei3QtDv5BeEhMVl45DXBbFO-qy_JdtXPvqK-JqtaB7JvmG8MscLmOBv4v3X2w_2kK8xtew7rYOMawevk4l45GABA2AcPMZ6CtgfEdRIxvhvEWOMOHOPLPoGS4CW8OVyA3QCrgBMIwB_SfetVPE7qaz9_F8I9fcsVwE7NkF3AGhRdwA8UoTESBpbtrkPNCCp-Ywjelb6RvwtkhhplhKsxFi8vcvUG-Da632NCGDR-2d6klHan9_r5cfNpsokfuDxaGe4Z7APhdGGodfNK6Az-LsKPOVYzv4mtSQsrQv8SRD_glzkOXXVvV0oTeBiOWbF6CGUM-z0Kl3KDavhhQsxFtMD6j1rIPeDY-ww3ePd7S410XwBijRW8zfGOj2wBHxPsA7wEnNm_uz6w_9B2949u67iIrQ3iD61uYsmhf3_JxXrCPK-C5IkPChh_FkVoywpH9pypYTgYTw698ygemgWL0YTa-YjbZRh_YNvoEJx4v4lUo8hOPwvDOxj8nzCc7zO7tXFFPAN5vuhklmE8-0t3wefr-54Ob7BJikDzE6P47EOfF9X6X9kxoHGHmRz_Mfgrr3zOz7m7xxvq8X5Bpo51XbCj6eDWalU13OY83-7poKfS-eD6uxs9EUDnXhQoMJ8xRy0LXbqHNkeH-leF-VR5-yW3xlEy3H0Od0bIvKRzfopXDjeR8pGG-uPQxk2suc56LGa3jVZ7zZLlM-axax1zmZbxaxYdM0hIzQVikaZEkpRRFktJMrTFCHsdRGkfJkvNFJIXA1SrNsTjk2WHJlhE1QtWLuj41PuSZsrandRoncTyrRUG1DTdGxEKUX6iVjD-M1zyGflczxPFm9Sqc0q3XJruZWXvEedEfLVtGtbLOvvlwytXhInpGSnYAvyprSU7XtAAGuuu0cX2r3DefKWdP5XDY-L9rPutNvb7JvnJVXyxK3TDce4djN--MfqLSMdwHhpbhPpD8OwAA__9mENqN">