<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/58467>58467</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [AArch64] Fold the mul and add into mla
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          vfdff
      </td>
    </tr>
</table>

<pre>
    test case:https://gcc.godbolt.org/z/Yaa1sj9qc
```
void f(int * __restrict__ a, int * __restrict__ b, int * __restrict__ r) {
  for (int m = 0; m < 64; m++) {
    int c = 0;
    // #pragma unroll
    for (int i = 0; i < 32; i++) {
      c += a[i] * b[m * 32 + i];
    }
    r[m] = c;
  }
}
```
This case first reported in https://bugs.llvm.org/show_bug.cgi?id=35448, and now both gcc and llvm can generate the SLP for the loops, but the clang's version still have less mla in the kernel body.
**gcc:: 7 mla** VS **clang: 4 mla**
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJx1U0uP2yAQ_jX4Mqplg185-JBslFMPlbaq1FOEAdtsiUkBZ7X99R1ItNlUXWkshnl9M_Phwcq3PigfQHCvCNvOIZw9noQeUCYh8snKwZqQWzeh5Q9-Pzkv_cvmtyDFnhRb0hQ3SdeL1RJGQju9BCB0C8ejQwCnRTgegRP6BP_1DJ96HKEbIO3uWh9gtA5u9U9A2B4KwnZJfYKmSjqhuyQPeZDKi_eUu_06LR7s7Ph04rAuzhpzD_gAqe-QOkEymvRPICECoh1zOKl3mtT7NOGAl1PSGI0BED2PTbX7-8XF8JSLhcSHuPeou_LIx_dZ-8QujNoh0U6drQtK4jLgke1hnXxuzOV049rP9vWIxlxMmrCDlojN6qrqIlN8kbDYVxhsmAHfSTLEZMRaYFKLcjwoCLOC56_f0gKjbqxFRMwf1pAMwvAFwVoPF-W8tgv4oI2BmV8wWnkPJ8NjrzH4l3KLMogp3_LbsDQK4scp2BbaGH41wo9nuGpXDPRWd2-m-rJpmrKjdcEy2TO5YRueBR2M6nHZ260TM74mXPnBGpngT6tJY3IZtxdsrJatzvT__DU6zOuQC4vP8BBXcju-nJ19USLgVXu_KtzDoe6qps3mXnWFqrtiaCSlqqjqjrVcyIKVAxtb2g2Z4YMyPnZGKF3UK6QSqGOHme5pQWlZlJuS0o7SvB43hRStYKMaaFe3pCrUiWvzTm_m-tRS5BydRvtw5z7j3utpUWkRsT5fw2xdfxnlOGYJuE-N_wVg-zl3">