<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/61047>61047</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            [SLP][AArch64] Over-eager SLP vectorisation

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          sjoerdmeijer

      </td>

    </tr>

</table>

<pre>

    I am opening this issue to discuss possible approaches as the problem seem to have been identified already. I.e., my motivating case is very similar to the pre-committed test case in 3c5e24a51ce072fd0083396dcf0ea107b1858d11

Taking the very first test case, SLP vectorisation is indeed not profitable, which you could probably guess by just eyeballing the codegen: 

https://godbolt.org/z/E64TMKsx9

But there are actually quite a few things going on here, I don't think it is only related to generating fmas or not. To illustrate this, here are the timeline view from MCA:

 ```Timeline view:

                      0123456789          01

  Index 0123456789          0123456789

  [0,0]     DeeER.    .    .    .    . ..   mov        x8, x1

  [0,1]     DeeeeeeER .    .    .    .    ..   ldr s0, [x0]

  [0,2]     D==eeeeeeeeER  .    .    .    ..   ld1r { v1.2s }, [x8], #4

  [0,3]     D==========eeeeeeER .    .    ..   ldr d2, [x8]

  [0,4]     D================eeeER   .    ..   fmul v1.2s, v1.2s, v2.2s

  [0,5]     D================eeeER   .    ..   fmul v0.2s, v2.2s, v0.s[0]

  [0,6]     D===================eeER .    .. rev64      v1.2s, v1.2s

  [0,7]     D=====================eeER    .. fsub       v3.2s, v0.2s, v1.2s

  [0,8]     .D====================eeER ..   fadd       v0.2s, v0.2s, v1.2s

  [0,9] .D======================eeER  ..   mov        v3.s[1], v0.s[1]

  [0,10] .D========================eeER..   str        d3, [x0]

  [0,11] .D========================eeeeER   st1        { v2.s }[1], [x1]

```

VS:

  ```Timeline view:

 01234567

  Index     0123456789

  [0,0]     DeeeeeeER .    . .   ldp s0, s2, [x1]

  [0,1]     DeeeeeeER .    . .   ldr      s1, [x1, #8]

 [0,2]     DeeeeeeER .    . .   ldr      s4, [x0]

  [0,3]     D======eeeER . .   fmul     s3, s1, s0

  [0,4]     D======eeeER   . .   fmul     s0, s0, s2

  [0,5]     D=========eeeeER .   fnmsub   s2, s2, s4, s3

  [0,6] D=========eeeeER .   fmadd    s0, s1, s4, s0

  [0,7] D=============eeER   stp      s2, s0, [x0]

  [0,8]     .D============eeER str      s1, [x1]

```

So to me this looks like a combination of problems:

- we emit higher latency instructions, e.g. `LD1R` and `ST1`

- We emit more instruction, e.g. `REV` and `MOV` for the shuffle and extract.

More instructions doesn't need to mean slower, but in this case it is creating this dependency chains or critical paths, and there is not enough parallelism that this is profitable. 

To me, it looks like a lot of cost-modelling of insertelement, shuffle vector, and extractelement gone wrong here so am going to look into that. But if you have other ideas about this @fhahn or @davemgreen then I am open to suggestions of course.

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJysV11zo7gS_TX4pSsUH8YmD35IJpOqqTtTu5Wk9j4L1IASIflKwonvr99qCWzsfGx2ZlwOxo50zumWdOhm1opWIW6i4joqbhZscJ02G_uo0fAexSOaRaX5fvMNWA96i0qoFlwnLAhrBwSngQtbD9bCVlsrKonAtlujWd2hBWbBdQhboyuJPVjEnqZ0bIdQISoQHJUTjUAOTBpkfB_DtxjjKPsC_R567cSOOSKtmUUQFnZo9mBFLyQzhBXw8aLWfS-cQw4OrRuHK8jrArMlK9Iak3XW8CQp8_xyxesmQZYm6yoti5KnaZTcRMlVuD6wpxAmBrZGGOuOsKTt_vufsMPaaSMsc0IrkiYUR-SgtKOIG-FYJf3o507UHez1ALUeJPf5YJXcQzugtVDt4XGwDnCPFZNy4q41xxZVlF_BXF3n3NZG-VWU3UbZbat5paWLtWmj7Pb_UXb7dbV8-PEf-3I5n3Q9OMI0CIz-ajcwKffwv0E4BAYNPtOqqtZCq4lfK6DRJP4bcK2ibO38iCcQjmLVSu7BoGQ-4xpaVGjCSjU9s6AN5SGGBw1CysE6wxz6nUOYByUUpxM9SqEQdgKfoTG6hx9frijCWQAQrZLwfpgPP4yCN19JmuXLYrUuL-e_TTO-KY4v74yZfpvGRsV1EmVfkqi48SNuEL_exXT36hLTZ693E9hLSSG_pGdQ6QyKXl_vXkPRha6SG7A0iSa_kIgzsOwAFuX0RjyAvgOYGojW17BL48xCtL6ZwEsCp_ssX56R5Gck773fimeKgmcnRKcEy08SvElJoc7Imn6QIToiPN5kdHNKW_x22uSEjW6S2Hq2VzGvfpp8puKQ6xgM7lbLsPHOoj_lXf8674w9cDd2qMZNv8uPkb8vopxExL8uIaSfcT4pSD6j4JIU_Br7PA3nZ3-X-3VPx0M1boP09TZIk98l5CjHi7HOTGJ4_oGDpOnvFjBuDOvSSYA3nCwOfnPMCik6puRg9HP7_-v-_Hnwjw-EycNP3f7T7n5qYcG-tqMJ2-y17E_4-uSB_mXTI0aw25khvnL1j6GWH6zrR6Y9QR6My6P5XRLk2eRfOPTRDE_xkhEpJO4nrHcWeqP64DFhCcarj9_mbznrZ3H70ThGnekM9zwH63_CfcMdrduOa5Uds_HOiv1LU_QMh1M-31Yfnqh7TWVbH2oykFo_WZDiiarBWveVUKGs1c1UvdvD0bqAZwTshYNOtB0aoCpQ1XsQyjoz1DTRGy7GbUzn9PtNehetEmCK09f7h_Qg5gL-O2L12uAcYQ5w9_Wv2fwff_hvjTa-frTd0DTUdygO-OIMq108j_THGbAFrtGGklZhKF97ZAqs1M9oiLcaHDUPPjWhk_BFb20wlLj-Hxy3qLgPvO6YUL7orY1womYStsx1PgkkK5TewvreAJUe2g62zDApUQrbg-uYm9qqWe8QnxT-DySTEIU7XS-pHa1Tra276DXH0ELohoJG41Bij8r5fTemKrQuk7oxaeM4aLVCeDZataFOt5qav9AXOO2pQSjfejEXA7UWovHdje_rNAVLjR2zwCo9jIFFy6TpWKcoSdEy4WyHfWuoB3QdKjg0mERhh7ZFG9bKBzYYiydrGq4Lvsn5ZX7JFrhJV-t1Ua5XRbnoNhmvV8uy4Mklr5bVmlXY1AVf1yVmdcmLaiE2WZLlSZaV9CxIyzhfFatVmVfFkrOirJJomWDPhIyl3PXUXS18w7tZpclyvZCsQml915xlCp9DNxxlZNgLs6E5F9XQ2miZSGGdPaI44aRvt--__0kHtLi-ujJ1t_LG-scOzQWyFs3rBnMxGLk56_2E64YqrnUfZbfEMH5cbI1-xNpF2a3XZaPs1uv-OwAA__-g0kTK">