<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/56100>56100</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            Loading an `i128` and bit-casting to `<2xi64>` in a loop does not select `vmovdqu`
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          mcy
      </td>
    </tr>
</table>

<pre>
    Consider the following C++: https://godbolt.org/z/TKWWETf9o

```cc
#include <wmmintrin.h>
#include <string.h>
#include <array>

template <typename T>
T ManyAeses(const char* input, int times) {
  T x, y;
  for (int i = 0; i < times; ++i) {
    memcpy(&y, input, sizeof(y));
    if (i == 0) x = y;
    __m128i xm, ym;
    memcpy(&xm, &x, sizeof(xm));
    memcpy(&ym, &y, sizeof(ym));
    xm = _mm_aesenc_si128(xm, ym);
    memcpy(&x, &xm, sizeof(x));
  }
  return x;
}

template std::array<size_t, 2> ManyAeses<std::array<size_t, 2>>(const char*, int);
template __uint128_t ManyAeses<__uint128_t>(const char*, int);
```

Naively we expect identical codegen, since `i128` is just a funny `[2xi64]` on amd64. However, we end up getting some very silly codegen for the second instantiation:

```S
unsigned __int128 ManyAeses<unsigned __int128>(char const*, int):
        xor     edx, edx
        test    esi, esi
        jle     .LBB1_1
        mov     r8, qword ptr [rdi]
        mov     rcx, qword ptr [rdi + 8]
        vmovdqu xmm0, xmmword ptr [rdi]
.LBB1_5:
        test    edx, edx
        cmove   rax, r8
        cmove   rdi, rcx
        vmovq   xmm1, rdi
        vmovq   xmm2, rax
        vpunpcklqdq     xmm1, xmm2, xmm1
        vaesenc xmm1, xmm1, xmm0
        vpextrq rdi, xmm1, 1
        vmovq   rax, xmm1
        inc     edx
        cmp     esi, edx
        jne     .LBB1_5
        vpextrq rdx, xmm1, 1
        ret
.LBB1_1:
        ret
```

`y` is loaded here using a `cmov; cmov; vmov; vmov; vpunpack` sequence, when a single mem2xmm `vmov` would have worked just fine (and which is what the other instantiation generates).

I can't get this to repro without an un-unrollable loop. There may be architectural details I'm missing here, but my default assumption is that some pass is failing to collapse the `i128*` loads into `<2xi64>*` loads. This seems to be a `i128`-specific problem because `[2xi64]` and `{i64,i64}` optimize just fine, and the fact that `i128 -> <2xi64>` bitcasts are present in the optimized IR makes me suspect that this is not actually an instruction selection issue.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJyNVk1v6jgU_TVhYxWFUCgsWBT6qqnmzSxmKr0lMrYBt7EdYqeQ9-vnXCfQ8NFqqhQnud_nXh9n5WQ9WzjrtVQlC1vF1i7P3V7bDVsk2Zyu4SPbhlB43CTZM66NkyuXh74rN3j6jf_XP3_9-vG6nrokfUrSx_Z3nDaXEO2LbKityCupWDJc7I3RNpTa9rfJ8MctDU_SzZdiXpa8_pTF36BMkfMQ5aEulOVGsdeT0iv7i9v6UXnlk2wiUHdgYsvLJHtk2hZVSLIFbgIL2pDKlCUP88aUsVd2IDFCnl6tXcngiCw0Qj6xFMJ4u2hd4LGBUV94Y8woI4oa5kk2rpvAbQZe_1ZuDQleT-kadsz0OoakcE1E-D3E4PWZ3nJpBqR3MDFrcybsxm4U6OYsNr2-Cn6W89GuPs_5lt3BxAyXxiw50Ldi6fWAvExO6X0T6XBM0JxneBUoeXg63pYqVKVFy47Sk-xiUnyQNNnDx3aeFuR_GfuQYXA6E0MT-a0uXRdj1Q5UN81T5OWygggwLMNZlM77_-fxtNO6Bf7N9YfKa7ZXTB0KJTCiUtmgBc-ZcFJtlG3QtALbZZzGfoxTpj17qxCOs3VlbU2iZDTPDnp8n4yeSMNZxo0c3_fZH26vPlRJfiiMlawq2EaFQPzhHfYepDVi5EikDRo3DTGNV6hLohYfOPLiQTtL6N6ikH-b5wpEtbFKAroGoTPcrqQtesCNRQgv0Hv8nLY4o8iL_pSM80bLmTwooEJyr6Mcy5n8LVdx7f-czwfLwbnQuI-4lhOy3e1dKVkRwB6jeSk1AXtbXRxu6ROnsMmV1QfM5K7CbjMpmWH9KlCT5OgKhFORX4AgEILKLHmUo5zbYhkxovSvMtxFPjCDqCD1lwpZVOCXHorKFuI938ld07XW09EgPp9bNIzT1Tyu6aVvdQjl7pj9UfHSX5tiC8F1QOyo0yBdoFOw7gRdyt9sd4JGXyV3-CY50F63wYOrBp8UbpIGHuuWBHLHJfbSVpWKVZ42NCcuoA7TqXZcPy5X6g8X7-TFq10F5FXkhy22Pie62WCfgN0zVEAOoyGU967KEY5jfjC07wgdaWitAQp2MQdV7LdabCm3_ZaHyCEOP-U5hYB_rCrBr3R-97vFvTDBwXkPgSgK5nAUHAApSsf2OmxdBdazrLJ3lS3xGcRXyDR3ruiz1wiD4TVbKcZLsdUBjFqV4FKpAte5Zy_wbJjRPmJF-lT2Cj5NDaU1r3K4974yRUyTolMZkSYLCOjNGq7IHHkJyqDwKtZ55OeMOhRb44nGXGTn4aJhZ2K7TzklDY9eKRPrpMQ7PH_ncSbotRYM5aNQAwXBK69uED5hT28f5vQuW9DvQ3MUoBaDM_CzV1Q06cdvSS5CU2Qbl93RidpNGD5WOgjugweuAKKkzYqzyjb9bf1L9vIP4H9XHqPDfOXjgRaaMdAROusAL3rC6axBG2koykpErL3KlWhR95Xq9-RsKKfDKe8FHXI1-wnA4oTb7klIZSC5O8qubcoF3LRTaKppSJh0qsmiiXacbRAy7npVmc8uPqMxctWqLxy-ap7z_OO43KEhb3CAx5gt5vh5NB6kaW87G6vxSmXpvZzeq_VQDgdSqWw1zrJJ9oD2DXsYWpX7GRqI3vX0LEuzLB0PJukom6RpX4ppBnuVTu-zschWyX2qDGauT4Hpg75XzmIOq2rjIcw1GvMpxJTSCauO_nmFXVPOjKh7MdVZzPM_zrWe8Q">