<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/71519>71519</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [AArch64] Missed fadd vectorisation opportunity (tsvc, s231)
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            backend:AArch64,
            vectorization
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          sjoerdmeijer
      </td>
    </tr>
</table>

<pre>
    Looks like we are 1400% (?!) behind for kernel s231 in TSVC compared to GCC. 
Compile this code with `-O3 -mcpu=neoverse-v2 -ffast-math`:

```
__attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000],
 aa[256][256],bb[256][256],cc[256][256],tt[256][256];

int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float);

float s231()
{
    for (int nl = 0; nl < 100*(100000/256); nl++) {
 for (int i = 0; i < 256; ++i) {
            for (int j = 1; j < 256; j++) {
                aa[j][i] = aa[j - 1][i] + bb[j][i];
            }
 }
        dummy(a, b, c, d, e, aa, bb, cc, 0.);
 }
}
```

Clang's codegen:

```
.LBB44_3: //   Parent Loop BB44_1 Depth=1
 //     Parent Loop BB44_2 Depth=2
 // =>    This Inner Loop Header: Depth=3
        add     x12, x21, x10
        add     x13, x20, x10
        subs    x11, x11, #5
 add     x10, x10, x19
        ldr     s1, [x12, #1024]
 fadd    s0, s1, s0
        ldr     s1, [x12, #2048]
        str     s0, [x13, #1024]
        fadd    s0, s1, s0
        ldr     s1, [x12, #3072]
        str     s0, [x13, #2048]
        fadd    s0, s1, s0
 ldr     s1, [x12, #4096]
        str     s0, [x13, #3072]
 fadd    s0, s1, s0
        ldr     s1, [x12, #5120]
        str     s0, [x13, #4096]
        fadd    s0, s1, s0
        str     s0, [x13, #5120]
        b.ne    .LBB44_3
 ```

vs. GCC's codegen:

```
.L521:
        ldr     q0, [x8, x0]
        ldr q1, [x2, x0]
        fadd    v0.4s, v0.4s, v1.4s
        str     q0, [x1, x0]
        add     x0, x0, 16
        cmp     x0, 1024
 bne     .L521
```

See also:
https://godbolt.org/z/jr9WKW95v

TODO: 
root cause analysis.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy0Vt9v4jgQ_muGlxHIHhMgDzwAOe5O11NP2ur2sXISA6YhZmPDtvvXn-yEQLpptVV1EfLPz998Hs8YS2v1tlRqDtESomQgT25nqrndG1XlB6X3qhqkJn-Z3xnzZLHQTwq_K5SVQj5mDChCoBmINRAHijFVO13muDEVPqmqVAVaEhx1iQ9f_l1hZg5HWakcncHfV6sRAkuALVbmcNSFQrfTFjOTK_yu3Q5hwob3AoeH7HgCkZTKnFVl1fBMONxspHXDg3Q7mDAQi5qoKSes-YXu46N0rtLpyanHR6-WZrLwu86BZpMxUFz_cFMY6VBCtBTEGIMoAVql3W7W7ebdrup2a_soPSVFEz_YNmiVpr3DWdY77FzPsFje7luXDvPT4fACNAt76arB_2vsZ7GfmqL41b7qc_GBFE4vbuamDQgRQ8ABzbwHygJBJMhALOv2CrmP1AXQjDP_Aa29zWAHywJoGX4xXilv-PSVTgc2v1YssV6mu-tuvhuKfaDgftX-lmLfZ_rVF4JnX3tKQ5QEqnoQh8hvJ2iJIaZu0K0nbylhmjSD11bzXcJH-uNIfZH5IveF8oWsZ-qpMMdGtyd2pbw2uunYpHwhyy3QtM73rSrfz-HR3XI5Hj8KEAsEWgOtEfEfWanS4Z0xRwzTHBN1dDsQCb-ouWB70NSiqYsGkYD4za958BfSn2WpqnrdH0rmqvIiLktF138yz0P9zMn75tnfiit85uwtmKhhrBdmT6mtYQ1NqIBEdLlZWp6WIFRxl6fIq5qvXh8tG3lAgjMa-zhpwr4htIGnhlv2y2TExrMr2WUTrsGzFi96jV_y5nMaBJvSRzT0an5XwzvGxyyefMR4V-wndx5xYh8x3iv2VzS8w9mrIR2VytdtEjfZ1ncvnO3Ivww-cjNExFvIazd9ayXOQmb8LM4jv7XOpDdQF7ec2WhsPeba4L7R65-rcf4GbZu_rAHQCvmki8kOxxtMSJl6Pq29irUH3r5qvyiFsrCmddLOuaP1vXDdbU2emsKNTLUFWv8AWu-r-OtfX-PofMvycJ_ch9s39CpjHGbyZBXKUhYvVtvRIJ-LPBaxHKg5n8SxiKKIx4PdPJuR2MgNl5HIhMximap8qkTG4zwSPJUDPSdGgnM25ZzzKB5FMx5tIjFhKpczFSsYM3WQuhgVxfnglQ60tSc1n3JvoZCpKmx4whKlMntSZQ5isVhU2c6_7_yhAtFZZc5U-od02pR-NEoG1dwzDtPT1sKYFdo6e7XhtCvCy_jCFCX4t7ZW5XU8NIQ2EKI5Hk3lTqV2L_5_39lz-Husny3x4FQV81eO1253SkeZOQCtvdGmGh4rs1eZA1qHXVqgddjofwEAAP__06bh8A">