<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/56100>56100</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Loading an `i128` and bit-casting to `<2xi64>` in a loop does not select `vmovdqu`
</td>
</tr>
<tr>
<th>Labels</th>
<td>
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
mcy
</td>
</tr>
</table>
<pre>
Consider the following C++: https://godbolt.org/z/TKWWETf9o
```cc
#include <wmmintrin.h>
#include <string.h>
#include <array>
template <typename T>
T ManyAeses(const char* input, int times) {
T x, y;
for (int i = 0; i < times; ++i) {
memcpy(&y, input, sizeof(y));
if (i == 0) x = y;
__m128i xm, ym;
memcpy(&xm, &x, sizeof(xm));
memcpy(&ym, &y, sizeof(ym));
xm = _mm_aesenc_si128(xm, ym);
memcpy(&x, &xm, sizeof(x));
}
return x;
}
template std::array<size_t, 2> ManyAeses<std::array<size_t, 2>>(const char*, int);
template __uint128_t ManyAeses<__uint128_t>(const char*, int);
```
Naively we expect identical codegen, since `i128` is just a funny `[2xi64]` on amd64. However, we end up getting some very silly codegen for the second instantiation:
```S
unsigned __int128 ManyAeses<unsigned __int128>(char const*, int):
xor edx, edx
test esi, esi
jle .LBB1_1
mov r8, qword ptr [rdi]
mov rcx, qword ptr [rdi + 8]
vmovdqu xmm0, xmmword ptr [rdi]
.LBB1_5:
test edx, edx
cmove rax, r8
cmove rdi, rcx
vmovq xmm1, rdi
vmovq xmm2, rax
vpunpcklqdq xmm1, xmm2, xmm1
vaesenc xmm1, xmm1, xmm0
vpextrq rdi, xmm1, 1
vmovq rax, xmm1
inc edx
cmp esi, edx
jne .LBB1_5
vpextrq rdx, xmm1, 1
ret
.LBB1_1:
ret
```
`y` is loaded here using a `cmov; cmov; vmov; vmov; vpunpack` sequence, when a single mem2xmm `vmov` would have worked just fine (and which is what the other instantiation generates).
I can't get this to repro without an un-unrollable loop. There may be architectural details I'm missing here, but my default assumption is that some pass is failing to collapse the `i128*` loads into `<2xi64>*` loads. This seems to be a `i128`-specific problem because `[2xi64]` and `{i64,i64}` optimize just fine, and the fact that `i128 -> <2xi64>` bitcasts are present in the optimized IR makes me suspect that this is not actually an instruction selection issue.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJyNVk1v6jgU_TVhYxWFUCgsWBT6qqnmzSxmKr0lMrYBt7EdYqeQ9-vnXCfQ8NFqqhQnud_nXh9n5WQ9WzjrtVQlC1vF1i7P3V7bDVsk2Zyu4SPbhlB43CTZM66NkyuXh74rN3j6jf_XP3_9-vG6nrokfUrSx_Z3nDaXEO2LbKityCupWDJc7I3RNpTa9rfJ8MctDU_SzZdiXpa8_pTF36BMkfMQ5aEulOVGsdeT0iv7i9v6UXnlk2wiUHdgYsvLJHtk2hZVSLIFbgIL2pDKlCUP88aUsVd2IDFCnl6tXcngiCw0Qj6xFMJ4u2hd4LGBUV94Y8woI4oa5kk2rpvAbQZe_1ZuDQleT-kadsz0OoakcE1E-D3E4PWZ3nJpBqR3MDFrcybsxm4U6OYsNr2-Cn6W89GuPs_5lt3BxAyXxiw50Ldi6fWAvExO6X0T6XBM0JxneBUoeXg63pYqVKVFy47Sk-xiUnyQNNnDx3aeFuR_GfuQYXA6E0MT-a0uXRdj1Q5UN81T5OWygggwLMNZlM77_-fxtNO6Bf7N9YfKa7ZXTB0KJTCiUtmgBc-ZcFJtlG3QtALbZZzGfoxTpj17qxCOs3VlbU2iZDTPDnp8n4yeSMNZxo0c3_fZH26vPlRJfiiMlawq2EaFQPzhHfYepDVi5EikDRo3DTGNV6hLohYfOPLiQTtL6N6ikH-b5wpEtbFKAroGoTPcrqQtesCNRQgv0Hv8nLY4o8iL_pSM80bLmTwooEJyr6Mcy5n8LVdx7f-czwfLwbnQuI-4lhOy3e1dKVkRwB6jeSk1AXtbXRxu6ROnsMmV1QfM5K7CbjMpmWH9KlCT5OgKhFORX4AgEILKLHmUo5zbYhkxovSvMtxFPjCDqCD1lwpZVOCXHorKFuI938ld07XW09EgPp9bNIzT1Tyu6aVvdQjl7pj9UfHSX5tiC8F1QOyo0yBdoFOw7gRdyt9sd4JGXyV3-CY50F63wYOrBp8UbpIGHuuWBHLHJfbSVpWKVZ42NCcuoA7TqXZcPy5X6g8X7-TFq10F5FXkhy22Pie62WCfgN0zVEAOoyGU967KEY5jfjC07wgdaWitAQp2MQdV7LdabCm3_ZaHyCEOP-U5hYB_rCrBr3R-97vFvTDBwXkPgSgK5nAUHAApSsf2OmxdBdazrLJ3lS3xGcRXyDR3ruiz1wiD4TVbKcZLsdUBjFqV4FKpAte5Zy_wbJjRPmJF-lT2Cj5NDaU1r3K4974yRUyTolMZkSYLCOjNGq7IHHkJyqDwKtZ55OeMOhRb44nGXGTn4aJhZ2K7TzklDY9eKRPrpMQ7PH_ncSbotRYM5aNQAwXBK69uED5hT28f5vQuW9DvQ3MUoBaDM_CzV1Q06cdvSS5CU2Qbl93RidpNGD5WOgjugweuAKKkzYqzyjb9bf1L9vIP4H9XHqPDfOXjgRaaMdAROusAL3rC6axBG2koykpErL3KlWhR95Xq9-RsKKfDKe8FHXI1-wnA4oTb7klIZSC5O8qubcoF3LRTaKppSJh0qsmiiXacbRAy7npVmc8uPqMxctWqLxy-ap7z_OO43KEhb3CAx5gt5vh5NB6kaW87G6vxSmXpvZzeq_VQDgdSqWw1zrJJ9oD2DXsYWpX7GRqI3vX0LEuzLB0PJukom6RpX4ppBnuVTu-zschWyX2qDGauT4Hpg75XzmIOq2rjIcw1GvMpxJTSCauO_nmFXVPOjKh7MdVZzPM_zrWe8Q">