<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/58467>58467</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            [AArch64] Fold the mul and add into mla

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          vfdff

      </td>

    </tr>

</table>

<pre>

    test case:https://gcc.godbolt.org/z/Yaa1sj9qc

```

void f(int * __restrict__ a, int * __restrict__ b, int * __restrict__ r) {

  for (int m = 0; m < 64; m++) {

    int c = 0;

    // #pragma unroll

    for (int i = 0; i < 32; i++) {

      c += a[i] * b[m * 32 + i];

    }

    r[m] = c;

  }

}

```

This case first reported in https://bugs.llvm.org/show_bug.cgi?id=35448, and now both gcc and llvm can generate the SLP for the loops, but the clang's version still have less mla in the kernel body.

**gcc:: 7 mla** VS **clang: 4 mla**

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJx1U0uP2yAQ_jX4Mqplg185-JBslFMPlbaq1FOEAdtsiUkBZ7X99R1ItNlUXWkshnl9M_Phwcq3PigfQHCvCNvOIZw9noQeUCYh8snKwZqQWzeh5Q9-Pzkv_cvmtyDFnhRb0hQ3SdeL1RJGQju9BCB0C8ejQwCnRTgegRP6BP_1DJ96HKEbIO3uWh9gtA5u9U9A2B4KwnZJfYKmSjqhuyQPeZDKi_eUu_06LR7s7Ph04rAuzhpzD_gAqe-QOkEymvRPICECoh1zOKl3mtT7NOGAl1PSGI0BED2PTbX7-8XF8JSLhcSHuPeou_LIx_dZ-8QujNoh0U6drQtK4jLgke1hnXxuzOV049rP9vWIxlxMmrCDlojN6qrqIlN8kbDYVxhsmAHfSTLEZMRaYFKLcjwoCLOC56_f0gKjbqxFRMwf1pAMwvAFwVoPF-W8tgv4oI2BmV8wWnkPJ8NjrzH4l3KLMogp3_LbsDQK4scp2BbaGH41wo9nuGpXDPRWd2-m-rJpmrKjdcEy2TO5YRueBR2M6nHZ260TM74mXPnBGpngT6tJY3IZtxdsrJatzvT__DU6zOuQC4vP8BBXcju-nJ19USLgVXu_KtzDoe6qps3mXnWFqrtiaCSlqqjqjrVcyIKVAxtb2g2Z4YMyPnZGKF3UK6QSqGOHme5pQWlZlJuS0o7SvB43hRStYKMaaFe3pCrUiWvzTm_m-tRS5BydRvtw5z7j3utpUWkRsT5fw2xdfxnlOGYJuE-N_wVg-zl3">