<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/71522>71522</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [AArch64] Missed doubly loop fmla vectorisation (tsvc, s235)
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            backend:AArch64,
            vectorization
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          sjoerdmeijer
      </td>
    </tr>
</table>

<pre>
    GCC12 vectorises the statements in both the outer and inner loop. Clang doesn't do any vectorisation. As a result, we are about 90% behind for kernel s235 in TSVC.

Compile this input with `-O3 -mcpu=neoverse-v2 -ffast-math`:

```
__attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000],
 aa[256][256],bb[256][256],cc[256][256],tt[256][256];

int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float);


float s235()
{
    for (int nl = 0; nl < 200*(100000/256); nl++) {
 for (int i = 0; i < 256; i++) {
            a[i] += b[i] * c[i];
 for (int j = 1; j < 256; j++) {
                aa[j][i] = aa[j-1][i] + bb[j][i] * a[i];
            }
        }
        dummy(a, b, c, d, e, aa, bb, cc, 0.);
    }
  return aa[1][2];
}
```

Clang's scalar codegen:

```
.LBB0_2:                                // Parent Loop BB0_1 Depth=1
        ldr     s0, [x21, x8, lsl #2]
 ldr     s1, [x22, x8, lsl #2]
        ldr     s2, [x23, x8, lsl #2]
 mov     w11, #255                       // =0xff
        mov     x12, x9
        mov     x13, x10
        fmadd   s0, s1, s0, s2
 ldr     s1, [x20, x8, lsl #2]
        str     s0, [x23, x8, lsl #2]
.LBB0_3:                                //   Parent Loop BB0_1 Depth=1
        ldr     s2, [x13, #1024]
        subs    x11, x11, #3
        fmadd   s1, s2, s0, s1
        ldr     s2, [x13, #2048]
 str     s1, [x12, #1024]
        fmadd   s1, s2, s0, s1
 ldr     s2, [x13, #3072]
        add     x13, x13, #3072
        str s1, [x12, #2048]
        fmadd   s1, s2, s0, s1
        str     s1, [x12, #3072]
        add     x12, x12, #3072
        b.ne .LBB0_3
        add     x8, x8, #1
        add     x10, x10, #4
 add     x9, x9, #4
        cmp     x8, #256
        b.ne .LBB0_2
```

vs. GCC's vector code:

```
.L4:
 add     x10, x22, x11
        sub     x9, x8, #1024
        ldr q29, [x21, x11]
        mov     x0, 0
        ldr     q30, [x2, x11]
        ldr     q31, [x28, x11]
        fmla    v29.4s, v30.4s, v31.4s
        str     q29, [x21, x11]
.L3:
        ldr     q30, [x10, x0]
        ldr     q31, [x9, x0]
        fmla    v31.4s, v30.4s, v29.4s
        str     q31, [x8, x0]
        add     x0, x0, 1024
 cmp     x0, x19
        bne     .L3
        add     x11, x11, 16
        add     x8, x8, 16
        cmp     x11, 1024
        bne .L4
```

See also:
https://godbolt.org/z/5fG1bffqz

TODO:
Root cause analysis.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy0V0-PozoS_zTOpZTILgcChxySoMxlVrPaGe21ZcDpkAGcwSbTPZ9-ZZsQyL_ufU8PRQaXi6pf_apcwULr4rWWckmCNQmSiWjNXjVLfVCyyStZHGQzSVX-vvyy2TCEk8yMagotNZi9BG2EkZWsjYaihlSZvROr1sgGRJ1DUdeygVKp4ww2pahfIVdS1wQXBnIFon7vTQpTqHoGKw0CGqnb0hDcwG8JopEgUtUaiCnBAFK5L-ocdqqBn7KpZQkaeWAB_Pj-382M0ITQlR83qjoWpQSzLyzCY2vgd2H2QEI6_cZhWmXHlvCkluokGy2nJ4Tpbie0mVbC7ElICV8N7VmJ_7npy4swpinS1siXF4IRwUiUls6cYBTOCcb-B7tSCQOCBGuOlFISJAQ36Xiajaf5eCrHU-8fhDWJQWiF_QNu0vSuOMvuio25I-brYdxFbSBvq-qdYORiGaOBf0p2C_ZvLWF8FZcffXZsEbkcxt3aolMFAFdsBCPLQ10C4QlQwtf-eQNIKcEVwYhRexHcWs_OG9QlwbX7xXAxObBXXMwV3loQusm99waXTX1BggSsEk8gvcxXkHWTPt6hx4PzyKyTw9Dj4QOPzqt1e_D0em886YRTNhTjGlwZjnRx1cO-IBtcZJGMhTeCcxUKm9XUDpkdcjtIOwi_4pfcGp0NE39ltZGmbWofQYcfR_V_Vr3a-l1_sS2N4EKDzkQpGshULl9l_bxtzL6u1_QFCb9h9_oiuCW4hX-LRtYGvip1BPsqg0QezZ7whI3ZKfPG3TW1gZNg_YbMPr1Fdix1CQS5i8-_1uuzXh-f6V-7wf41_uy1Sp2c_m_m_SDHIHgeMuEJfdvtxn7Pdt6YRxk_WvZoGB2v7yqR5z07PubuGR8TQj9BiDY3vD8jxBcA_3wBwF8ogT43ng2CnFGc30JvU-1J85XSp4g_II91hF3I-zwCpPPogqBnrSfb5_UR1E9AeOKb08Vt4rzBQcmMtG9yfAfpOKT_m6wnHHwA2G-BsfZINZ3VEs6ldt9IdClSS_ojV_S8nbze_Pz1cVaIu904Xu6urDoOnLm9Hz4Eik-a7UnP4Mtm49qt_2h07fajXjvvFW4i6podu85Jmw7jioY1eVPpvzC-6rWM3WSt70zOL72_X37xS_d4ZOiie2lQ0SPlXVUKez9hPJtrq3XitH9i9uluLT4NafaVXwh9HEHHMP1MBPED1R6_xzrG72O6j_9iOnpguq-EM0zcwCC_fc12lX_1R5PW0t0tFw-2zLCZsvDj7Xet0yPoTNzUXur2zPzJfvkuJYhSqz5de2OO2s7cn8qrylNVmplqXglu_xDcBrsvLN3tfv0ZWvnxLfnWW_iPUgYy0WoJohbluy70bJIveR7zWEzkkoVxzIOQh9Fkv2QxjeOc0ngeMMo5jXYMs3ARhukiYxELJ8USKXLG6IIxDGk0o2nKYirzMEdJdxkjcyorUZSzsjxVFuik0LqVywULECelSGWp3cEVMRXZT1nnhK9Wqybb28OX3UQEsTtf_nHnSysNkkmztBanafuqyZyWhTb64sMUpnTn4bOlIIF_FVrLHHLVpuW7O8_66hwdXu33tdGnzPd8e5qIJ21TLq94L8y-TWeZqghurdPuNj026iAzQ3DrotQEty7Q_wUAAP__QkTOMg">