<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/71515>71515</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [AArch64] Missed vectorisation opportunity (tsvc, s122)
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            backend:AArch64,
            vectorization
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          sjoerdmeijer
      </td>
    </tr>
</table>

<pre>
    We are not vectorising kernel s122 from TSVC whereas GCC is vectorising it. As a result we are about 2x slower with Clang. Compile this input with `-O3 -ffast-math -mcpu=neoverse-v2`:

```
__attribute__((aligned(64))) float x[32000];

__attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000],
 aa[256][256],bb[256][256],cc[256][256],tt[256][256];

int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float);

float s122(int xa, int xb)
{
    int n1 = xa;
    int n3 = xb;
    int j, k;
    for (int nl = 0; nl < 100000; nl++) {
        j = 1;
 k = 0;
        for (int i = n1-1; i < 32000; i += n3) {
            k += j;
            a[i] += b[32000 - k];
        }
        dummy(a, b, c, d, e, aa, bb, cc, 0.);
    }

}
```

Clang's codegen:

```
.LBB0_3:                                //   Parent Loop BB0_2 Depth=1
        ldr     s0, [x8], #-4
        ldr     s1, [x19, x9, lsl #2]
        fadd    s0, s1, s0
        str     s0, [x19, x9, lsl #2]
        add     x9, x9, x21
        cmp     x9, x22
 b.lt    .LBB0_3
```

GCC's codegen:

```
.L6:
 ldr     q30, [x1], -16
        ldr     q29, [x0]
        mov v31.16b, v30.16b
        tbl     v30.16b, {v30.16b - v31.16b}, v28.16b
 fadd    v31.4s, v29.4s, v30.4s
        str     q31, [x0], 16
 cmp     x22, x0
        bne     .L6
```

See also: 
https://godbolt.org/z/7zzrPazM7

TODO: root cause analysis.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy0Vl-P2jgQ_zTmZRRkjwkhDzwsibYvrVqp1d3jykkMeHFiGjssu5_-ZCdhCUurq05nRY4985vfeP5gIqxVu0bKNYk3JM5nonN7067ts5FtVUv1LNtZYarX9d8SRCuhMQ5OsnSmVVY1OzjItpEaLEOEbWtq-PH9rwxe9rKVwsKnLANlJwbKzeHBgoBW2k47eOl5RWE6B3gGq82LbOFFuT1kWjS7OWSmPiotwe2VBdUcO9eryZJGXzlE262wLqqF20NUl8eO8LyR5iRbK6MTkiUl_IHQnNBxXtLhCdunJ-Fcq4rOyacngiuCK6F9UiqCq-WCYNo_sNVGODiTeMORUkrinPDNNfEfMolrJsyK6bacbqvpVk63vX8QnhLjpRdeFpgVxV1xWd4VO3dHPA1UNQ6qrq5fCa5CLNPTwP8l-3jY_6TC9Cauvi6-mQmufJBn4dFhVXh4j00GIwAIuoYB4bkH81sN7zXFB82zJz5MxFvTwuC30cGOEr7p1xkw6kcvILgJTwqTs_jxHAzZO_HhwjQFXnlTAdKwyJuFXQZ9-vstboKe33Xox2HEPH9w44fvSkXifERdOh0iOFx312hAknwqGJstlKPwU-mnyk_ST6LX9Kqgo_Pr-k5YxzLmd--Dfg53D8HEQmkquZPN7y-R-efNhj5xwieB3xsEHwk-AsA30crGwWdjjuCNEXJ5dHvCczYNXldteFvq4yLx5rwa2pggjxa_ALMRzFK_OodZW-2N0NtPu0FU1buP3tjSKca623P8K-qBeYANM97EWNbHaxDioC7m2nn5mN9fF-xTlv1JuZYXwCVlP_l7YEOCI7a8n96fmI5Y-iHi2pzgxNmcLUM_njgNywnGFTq8R6UnSzbDDqKLfRLOccLVFcVYLI9Z2F6fjitO_epu4X5yNjk0ZnCJ71IAf_dlcL6pfdFI6Ouw_E0NvksJQlvjfwa9ZO_c0fpUh67fmaow2s1NuyP4-EbwMXl7a7-Jty_JNc2Pr_lXT9Ea46AUnZUgGqFfrbLzWbXmVcpTMZNrtkxTHi_S1Wq2XycFK3GxSLiUNJXIStzyhLFYinIby5TP1BopcsZowhjldDWPscKqQpqwJUOUSBZU1kLpudan2p9xpqzt5DphMYtnWhRS2_CdhFiI8iCbivCHh4e23Pv_dp82gjh867wJp0zjpXE-a9eeMSq6nSULqpV19t2HU06Hz6-RKc7hi7JWVpfvpsAF5ng0resa5V79xe3sqex_q951Outavb7JtnL7rpiXpib46P0Nr-jYmmdZOoKPIUDr6-Bj_CcAAP__7LqPmQ">