<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/60515>60515</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            Inefficient code generated for `__builtin_shufflevector` with runtime mask on ARM
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          lawben
      </td>
    </tr>
</table>

<pre>
    I'm currently looking at using `__builtin_shufflevector(vec, mask)` in a project on an M1 MacBook and I found that the code generated is worse compared to a) the x86 version and b) what it could be. Note: This is not the version with a constant mask but with a runtime mask. This is technically not documented but it is implemented. There is an [open issue](https://github.com/llvm/llvm-project/issues/59678) for the documentation side.

Here is a [godbolt link](https://godbolt.org/z/4fPKaKz8M) for the code example.

```cpp
using VecT __attribute__((vector_size(16))) = uint8_t;

VecT shuffle_builtin(VecT vec, VecT mask) {
    return __builtin_shufflevector(vec, mask);
}
```

This code is compiled to two instructions on x86 (`vpand` + `vpshufb`). The `vpand` is not strictly necessary, but it get's created in the frontend and is not removed during instruction selection. However, on ARM with NEON, this gets compiled to 66 instructions, logically consisting of 16x "extract element at i and insert it". The same applies to vectors with 2/4/8 elements [(godbolt)](https://godbolt.org/z/3aeqeqPv6). 

I'm currently looking into creating a patch for this for 16 elements, as this translates directly to a NEON `TBL` instruction. As this pattern is detected during instruction selection on x86, I guess it can also be detected for aarch64. Handling 2/4/8/ elements requires some additional thought, but I assume this can also be modeled in a similar way.

 
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJycVd9v4zYM_muUF2KBIye285CHZl1wxa23w3DYayBbdKxVlnwSnbT31w-UnTY97MdhQFCrlER-JD9-UjGak0Pcic1ebO4XaqTOh51Vlxrdovb6ZfcgZNlDM4aAjuwLWO-fjDuBIhgjL0SRHY_1aCwZd4zd2LYWz9iQD0JWZ2yE_Bl6FZ-E3IoiA-NAwRD8n9gQeAfKweMKHlWz9_4JlNPwAK0fnQbqFAF1CI3XCCd0GBShBhPh4kNkez-ogBrIgxJymw4_VwWcMUSTfGuoeePCrgxB40erocYlfPKEIr-DL52J7NH5Kdb16sVQBwoa7yIpRykDqEe6boTRkekx2ZevXgibzplGWfuSPGrfjD06Rs13DfEh0w8WJyvfxIBsVQ7EZu8HdGBiHFFs7oWsOqIhivxOyIOQh5OhbqyXje-FPFh7vn5-musp5CHdjUIeNtuirDj31oeU2RWLIs4vGo1Lkd2L7G76--GKg2GcvK69JbDGPf0tkGl_6cNJyMM3IQ_r9vNH9fFb9XgbMjUOnxUn_C6YKLLp1wzDZJm49Ac2X-B4VETB1CPh8ShkNdGIfDhG8w2FrFYFcyn9QOT3MBpH1ZFEvr-NkXzNfLzyU8gqmWdapvXMTRDlfB0AICCNwcEPE_s1dHn_XYa3kBJNUlHStx-MndhLFw_GRQpjw92JPBhMZM6-yM6DcppnR8g9pP8ZTc3O5TZRCG5PzWyOFEzDE-uwwRhVeGHAMw1PSEKWEZqA00y51LA2eEfodJqc2U_A3p9Rgx4Dt-gGJkS0mFZL-OAveMbAIbyDu98fp0H59Mtvn9hGnPkJ6X3aRfEuaz5o_WmeHx49E4lj-hZWxTMIKfGZgmoIcJogViEzgXURA6cmpJxKElWPoIbBGowcbGpdnHBJpqyQh-rqKTLthaxmZnNPf4j3ucKv-PXzuUituG32PwmnceSnuicZhUFR080jY2JarIpXWFwTFactCspFqwgjaBMwNZelL1WZKfBl_-sksa81XcLdfHlQRBhYXEAjYUP_0dKZghz_AU4jxpgEVDlQNnqo8c0NQ1YqNF2xXsIH5bRlr68lFvLwVuWAX0cTMEL03B6tDQdTFqjz46mjK0cfQMU49jiBvw3be412oqyCaHpjVYCLenknMLDQu1xv861a4G5VlJt1nlfr1aLbFQplW6yqUmObb-pStU2JayxUUcp1peuF2clM5pnM8tV6VcrVst1u8nWpN2tVroqyLcQ6w14Zu2TpZSoskujuimyz2iysqtHG9KBK6fAyq7mU_L6GXZLrejxFsc6siRTfvJAhi7sHh21rGsPs_u7h4zr_y2tbZBO3b5-meRgXY7C7__-WpMz-CgAA___q67Qp">