<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/96754>96754</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[X86][Aarch64][RISCV] Dynamic vector shuffle idiom is not always recognized and optimally lowered
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
Hendiadyoin1
</td>
</tr>
</table>
<pre>
Following pattern (c++):
```c++
i32x4{num[idx[0]],
num[idx[1]],
num[idx[2]],
num[idx[3]]}
```
is not optimized to a native vector shuffle by all architectures:
- X86 with SSSE3 does a good job generating a `pshufb` with some index adjustments,
- Arm only seems to do the same with the equivalent `tbl` instruction only in the byte sized case, failing to apply the same index folding trick x86 does, instead going through the stack to achieve dynamic indexing
- Risc-V fails to do a `vrgather` in all cases, instead constructing the result vector element by element
Compare here: https://godbolt.org/z/osPhv1KT4
This especially applies to the range checked version of that code:
```c++
i32x4{idx[0] >= 0 && idx[0] < 4 ? num[idx[0]] : 0,
idx[1] >= 0 && idx[1] < 4 ? num[idx[1]] : 0,
idx[2] >= 0 && idx[2] < 4 ? num[idx[2]] : 0,
idx[3] >= 0 && idx[3] < 4 ? num[idx[3]] : 0}
```
which all architectures support natively in some way,
but no tested backend seems to generate the ideal code.
x86 again seems to be quite good with its folding but fails to recognize and leverage the behavior of `pshufb`, even after giving it a hint on how a masked approach would look like:
```c++
idx = idx | ~((u32x4)idx < 4);
return i32x4{
(idx[0] != ~0 ? num[idx[0]] : 0),
(idx[1] != ~0 ? num[idx[1]] : 0),
(idx[2] != ~0 ? num[idx[2]] : 0),
(idx[3] != ~0 ? num[idx[3]] : 0),
}
```
A more comprehensive list of functions in multiple versions and possible optimized versions can be found here:
https://godbolt.org/z/Ej4hYsvPr
Important notes from the code:
```
// Note: x86-SSE3s `pshufb` inserts a 0 when the most significant bit of the
// index element is set, so masking it of with 0xFF for oob elements works
// fine [...]
// Note: ARMs `tbl` inserts a 0 when the index element is out of range [...]
// Note: RiscV-Vs vrgather.vv, similar to arm, sets a 0 when the index element
// is out of bounds,
// but additionally it acts with the current element width, making it not
// need any additional index manipulation or masking
```
This behavior is likely partially caused by those code snippets not being canonicalized to a single IR instruction;
Note that [`llvm.vp.gather.*`](https://llvm.org/docs/LangRef.html#llvm-vp-gather-intrinsic) only seems to handle stores to memory
and [`shufflevector`](https://llvm.org/docs/LangRef.html#shufflevector-instruction) seems to only allow shuffling by constants
Also note that in comparison to gcc clang does not expose `__builtin_shuffle (vec, mask)` or `__builtin_shuffle (vec0, vec1, mask)`, which represent the first case.
----
Just to compare, GCC on x86 fails to do any native shuffles and falls back to leveraging stack loads and even gets branch-y in the checked case, but has `__builtin_shuffle` which works quite good
Sidenote: UBsan destroys the second case even in x86 although it does not contain any possible UB afaict
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJycV0uT4ygS_jX4kmGHLNsq--BDPdq7va-Y6JrpmD11IEhLdCHQAJLLfZjfvpEg-VFdj4l1KMolAx-ZZH6ZH9x7VRnELVvdsdXDhHehtm77dzRScXm0yswnpZXH7c5qbQ_KVNDyENAZYPlasPwuPhu2uGXZA8tuWZGlZxyLv6pF_rxkN3ema9jqTslntrrL2OqBnvw-zYH4uZwxfznjcjD_cPlimHHz8MK0wSYPxgawbVCN-oESggUOhgfVI_QognXg626_1wjlEbjWwJ2oVUAROof-5PIUfl8XcFChhsfHx08LkBY9cKislfDdllChQccDHR4HVmQtwZasyNIibxsEZSQ-A5ffOx8aNMGfHJvCrWvAGn0Ej9h4slNaCDWC5w0mDHrDPzrVc40m0Cah1LSDMj64TgRlTcJQJk4ujwHBR78F98jye9hzpclGOoe21cfzFsm4vdUyjjslnuB5XURHaSVtglxCZeN47WxXJZt84OIpIopaYY8gj4Y3SiRIZarRxy_Ki-nXaMPoYTyr3lU81OiSLzEKZO_VtsKOTsbdERz6TocxiKiRTpSCOPybNr23TcsdQo0O2eIW6hDaGNV8x_JdZWVpdZhZV7F894PlO-t_qfv5P39dDukU__5aKw_oWxSKa32MJ6cwuhAt4aZCEDWKJ5TQo_MxEHsINQ8grMS_yJwzZ4AtPrHFA2TA8oLlBVwN3cMS2GIHrzANyMnsTJgzyd6AnL8NOX8dMn7O_HwDN38bN_8Qd_E27uJt3MUV7hsV4VArUf9MdPBd21oXhtqQOBRJe-DHk41lF8BYCOgDSii5eEIjz4wdagDGrFASuY7Rn6XFRCZeccIdF5QIf3QqYCojkeQq-BMJabsTWRwKWxn1A4EbCRp7dLxKW5VY815ZRzl3WXqIP9ijAb4P6KBSPaGqABxqZQJYA7U9AIeGe0pd3rbOclHDwXZagrb2CbR6-ih9h7gBRSt-39zDnyxfs3zdxczON2n4HpaxjVyscxg6Z2BkwGUqAHWfy7zP57TDn9lHub95kVMXQPP3geZ_GSh_Hyh_E-iEsHgfYfGeKT8nN6T3W2isQxC2aR3WaDz1Oa18oNTYdyb2CE-53XQ6qFbjWK98TKvWeq9KjRcd8zQuuKGE3dvOyLGepl0_Kqqfvi_r__r-F5emf26IadwQlwJ62DvbxDx-vVQOrxEb_mNDrOPP62JKfdhf91plPLpAjTmDQ42pDTbWByAJpPZK0LalCqk84xX2UIFiHxz7ifLgMRCRvI00GRhk94mu2fNuB3uini3HRR4O1j35V7D3yiCw1d1sNqPwvubY7Zd_--vW_rNDP5lou2hRakTv41MP_jr96mFsurO-j-6pRmnuYht3TfwF3933Cv1sQ0npcRY2V-5TPeNSKkrC2EipFAk6r1HeiM458mn07aBkqMmYho8nb2x4Bdog1S9zvMAfzG24UW2neVJHbgzi6_y56Penoqp8LIL6CC13ISkAwTtPHYD0k_Upc8Eb1bZ0aCQ5SyR7BTfWKMH1WXx6ZSqN8PnLpWw71UQKU9INJNaLTOu-mfXtbIgVy6PJJInX16yLExPlpBWe5bt_cVN9wf2sDo1m-YImTPt2mpCmygSnjFeC5ZsXwrPmRmqSddYlkdNgY90xGUhVIpk26OYkwP5fq65Appcnkm_OFkX7OF1OBrUeu-MxiUJOQjpVP-1trCnpBJWJhZA75a2J_VkIEJqbKql3ChM-txQ_VmTfvpWd0kGZb-OFgOXrHkXKPv9ENbjIKIPemRxbbo9ifr2K3pLwcNg69JTalO575XyIWnd2mX_T6XR6-f6PzgeyP3kTdfzf7u-pfZOmuJLT5jjebgbDUl3fc619FCw0cRAPdIhJu2vLZZoY5UJFSVw6bkQ9PV0mRn073iSIzDX3r55GvPZEf2MlvBA5yaFHJdEMBem3O88NSPTB2aNPFwoU1qSdkkEqecp1qOOtQ4VzBIU1gUQVuX5qX7_dAd9zJcJEbhdys9jwCW7nN_PNMpuv15tJvb1ZYLHZlLhcIV8vl6VYSZFnq8Uai0222S8naptn-TIr8mI-X6yzzUzOZVYUPMtFhvP5WrBlhg1XejYm-UR53-F2U9yslhPNS9Q-Xrnz3OAB4iDLSRlM3Daysewqz5YZ9Wd_Rgkq6HhX_31dRBFwd0tStVimly-fH--_kjJ4GG5ZL66xSirbwHDx5frAj_6sHWUMcmzusY5pe0CHctI5vX3RxVWou3ImbDPQePiats5-RxFYvoseEaeTx_02_18AAAD__6Y0FqU">