<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/61047>61047</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[SLP][AArch64] Over-eager SLP vectorisation
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
sjoerdmeijer
</td>
</tr>
</table>
<pre>
I am opening this issue to discuss possible approaches as the problem seem to have been identified already. I.e., my motivating case is very similar to the pre-committed test case in 3c5e24a51ce072fd0083396dcf0ea107b1858d11
Taking the very first test case, SLP vectorisation is indeed not profitable, which you could probably guess by just eyeballing the codegen:
https://godbolt.org/z/E64TMKsx9
But there are actually quite a few things going on here, I don't think it is only related to generating fmas or not. To illustrate this, here are the timeline view from MCA:
```Timeline view:
0123456789 01
Index 0123456789 0123456789
[0,0] DeeER. . . . . .. mov x8, x1
[0,1] DeeeeeeER . . . . .. ldr s0, [x0]
[0,2] D==eeeeeeeeER . . . .. ld1r { v1.2s }, [x8], #4
[0,3] D==========eeeeeeER . . .. ldr d2, [x8]
[0,4] D================eeeER . .. fmul v1.2s, v1.2s, v2.2s
[0,5] D================eeeER . .. fmul v0.2s, v2.2s, v0.s[0]
[0,6] D===================eeER . .. rev64 v1.2s, v1.2s
[0,7] D=====================eeER .. fsub v3.2s, v0.2s, v1.2s
[0,8] .D====================eeER .. fadd v0.2s, v0.2s, v1.2s
[0,9] .D======================eeER .. mov v3.s[1], v0.s[1]
[0,10] .D========================eeER.. str d3, [x0]
[0,11] .D========================eeeeER st1 { v2.s }[1], [x1]
```
VS:
```Timeline view:
01234567
Index 0123456789
[0,0] DeeeeeeER . . . ldp s0, s2, [x1]
[0,1] DeeeeeeER . . . ldr s1, [x1, #8]
[0,2] DeeeeeeER . . . ldr s4, [x0]
[0,3] D======eeeER . . fmul s3, s1, s0
[0,4] D======eeeER . . fmul s0, s0, s2
[0,5] D=========eeeeER . fnmsub s2, s2, s4, s3
[0,6] D=========eeeeER . fmadd s0, s1, s4, s0
[0,7] D=============eeER stp s2, s0, [x0]
[0,8] .D============eeER str s1, [x1]
```
So to me this looks like a combination of problems:
- we emit higher latency instructions, e.g. `LD1R` and `ST1`
- We emit more instruction, e.g. `REV` and `MOV` for the shuffle and extract.
More instructions doesn't need to mean slower, but in this case it is creating this dependency chains or critical paths, and there is not enough parallelism that this is profitable.
To me, it looks like a lot of cost-modelling of insertelement, shuffle vector, and extractelement gone wrong here so am going to look into that. But if you have other ideas about this @fhahn or @davemgreen then I am open to suggestions of course.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJysV11zo7gS_TX4pSsUH8YmD35IJpOqqTtTu5Wk9j4L1IASIflKwonvr99qCWzsfGx2ZlwOxo50zumWdOhm1opWIW6i4joqbhZscJ02G_uo0fAexSOaRaX5fvMNWA96i0qoFlwnLAhrBwSngQtbD9bCVlsrKonAtlujWd2hBWbBdQhboyuJPVjEnqZ0bIdQISoQHJUTjUAOTBpkfB_DtxjjKPsC_R567cSOOSKtmUUQFnZo9mBFLyQzhBXw8aLWfS-cQw4OrRuHK8jrArMlK9Iak3XW8CQp8_xyxesmQZYm6yoti5KnaZTcRMlVuD6wpxAmBrZGGOuOsKTt_vufsMPaaSMsc0IrkiYUR-SgtKOIG-FYJf3o507UHez1ALUeJPf5YJXcQzugtVDt4XGwDnCPFZNy4q41xxZVlF_BXF3n3NZG-VWU3UbZbat5paWLtWmj7Pb_UXb7dbV8-PEf-3I5n3Q9OMI0CIz-ajcwKffwv0E4BAYNPtOqqtZCq4lfK6DRJP4bcK2ibO38iCcQjmLVSu7BoGQ-4xpaVGjCSjU9s6AN5SGGBw1CysE6wxz6nUOYByUUpxM9SqEQdgKfoTG6hx9frijCWQAQrZLwfpgPP4yCN19JmuXLYrUuL-e_TTO-KY4v74yZfpvGRsV1EmVfkqi48SNuEL_exXT36hLTZ693E9hLSSG_pGdQ6QyKXl_vXkPRha6SG7A0iSa_kIgzsOwAFuX0RjyAvgOYGojW17BL48xCtL6ZwEsCp_ssX56R5Gck773fimeKgmcnRKcEy08SvElJoc7Imn6QIToiPN5kdHNKW_x22uSEjW6S2Hq2VzGvfpp8puKQ6xgM7lbLsPHOoj_lXf8674w9cDd2qMZNv8uPkb8vopxExL8uIaSfcT4pSD6j4JIU_Br7PA3nZ3-X-3VPx0M1boP09TZIk98l5CjHi7HOTGJ4_oGDpOnvFjBuDOvSSYA3nCwOfnPMCik6puRg9HP7_-v-_Hnwjw-EycNP3f7T7n5qYcG-tqMJ2-y17E_4-uSB_mXTI0aw25khvnL1j6GWH6zrR6Y9QR6My6P5XRLk2eRfOPTRDE_xkhEpJO4nrHcWeqP64DFhCcarj9_mbznrZ3H70ThGnekM9zwH63_CfcMdrduOa5Uds_HOiv1LU_QMh1M-31Yfnqh7TWVbH2oykFo_WZDiiarBWveVUKGs1c1UvdvD0bqAZwTshYNOtB0aoCpQ1XsQyjoz1DTRGy7GbUzn9PtNehetEmCK09f7h_Qg5gL-O2L12uAcYQ5w9_Wv2fwff_hvjTa-frTd0DTUdygO-OIMq108j_THGbAFrtGGklZhKF97ZAqs1M9oiLcaHDUPPjWhk_BFb20wlLj-Hxy3qLgPvO6YUL7orY1womYStsx1PgkkK5TewvreAJUe2g62zDApUQrbg-uYm9qqWe8QnxT-DySTEIU7XS-pHa1Tra276DXH0ELohoJG41Bij8r5fTemKrQuk7oxaeM4aLVCeDZataFOt5qav9AXOO2pQSjfejEXA7UWovHdje_rNAVLjR2zwCo9jIFFy6TpWKcoSdEy4WyHfWuoB3QdKjg0mERhh7ZFG9bKBzYYiydrGq4Lvsn5ZX7JFrhJV-t1Ua5XRbnoNhmvV8uy4Mklr5bVmlXY1AVf1yVmdcmLaiE2WZLlSZaV9CxIyzhfFatVmVfFkrOirJJomWDPhIyl3PXUXS18w7tZpclyvZCsQml915xlCp9DNxxlZNgLs6E5F9XQ2miZSGGdPaI44aRvt--__0kHtLi-ujJ1t_LG-scOzQWyFs3rBnMxGLk56_2E64YqrnUfZbfEMH5cbI1-xNpF2a3XZaPs1uv-OwAA__-g0kTK">