<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/136519>136519</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [X86] Use `vpinsrq` in building 2-element vector of 64-bit int loads
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          dzaima
      </td>
    </tr>
</table>

<pre>
    For building a two 64-bit element vector, clang currently does separate loads and packs them together, e.g. this code:

```c
typedef uint64_t u64x2 __attribute__((vector_size(16)));
u64x2 generic_int(uint64_t* a, uint64_t* b) {
    return (u64x2){*a, *b};
}

__m128i intrinsics(uint64_t* a, uint64_t* b) {
    __m128i lo = _mm_loadu_si64(a);
    return _mm_insert_epi64(lo, *b, 1);
}

__m128i intrinsics_int_domain(uint64_t* a, uint64_t* b) {
    __m128i lo = _mm_loadu_si64(a);
    __m128i t = _mm_insert_epi64(lo, *b, 1);
    return _mm_add_epi64(t, t);
}
```
via `-O3 -march=haswell` compiles to:
```asm
generic_int:
        vmovsd xmm0, qword ptr [rsi]
        vmovsd  xmm1, qword ptr [rdi]
 vmovlhps        xmm0, xmm1, xmm0
        ret

intrinsics:
        vmovsd xmm0, qword ptr [rsi]
        vmovsd  xmm1, qword ptr [rdi]
 vmovlhps        xmm0, xmm1, xmm0
        ret

intrinsics_int_domain:
 vmovq   xmm0, qword ptr [rsi]
        vmovq   xmm1, qword ptr [rdi]
 vpunpcklqdq     xmm0, xmm1, xmm0
        vpaddq  xmm0, xmm0, xmm0
 ret
```

even though the load of `b` could be done together with the packing via `vpinsrq` for integer domain, and `vmovhps` for unspecified domain if preferring float is desired, i.e.:
```asm
vmovq  xmm0, qword ptr [rdi]
vpinsrq xmm0, xmm0, qword ptr [rsi], 1
```

Additionally, per uops.info data, post-icelake, `vpinsrq` has higher throughput than `vmovhps`, and via some local microbenchmarking on Haswell I don't see any domain crossing penalties for either in any direction, so for it could make sense to always use `vpinsrq` and never `vmovhps` (or at least on the applicable targets).
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJzUVk9v47gP_TTKhUhgK46THHJIpyh-v9OeFtibIVu0za0suRKdTObTL-Q_mUxbzM4e9rCFgSbxI_X4-ERQhUCNRTyJ3ZPYPa_UwK3zJ_1NUadWpdO304vzUA5kNNkGFPDVQZ6tS2JAgx1ahgtW7LyQX6AyyjZQDd6jZXMD7TBAwF55xQjGKR1AWQ29ql4DcIsdsGuQWxzDcdNsgFsKUDmNYnsWyfjkyfRUIjnzrUeNNQxkOc8KhiHPvkooCsXsqRwYi0LIg5CHiVYR6BsKeUhzIY_zs30SyXmKa9Cip6ogy0IelqRCnkFFRo8_lEIeQexjLACARx68hRgVM8W8-ychz2OckOdS7J-nk-KHsZCi6FJ5ICDLnmygKvzDM5cExoHYPkPRdUUUdSgC5ZmQB3Wv7oFhRJEN6LnAfsIZdycpv0B6j_oJ06hQoV2nyP6rpBc038G_yP1dxUrrewhHJL-vcnGVSM4XUiDyZP3bFtad8lUrts-tClc0RuQJVK7ryWAAdrMpl1gVOpGcH000vof579K5S9DwteuSSOHt6ryGnj2I3ZMPJHbPH8ERnX5E6wUdYabtwxK2JF_Cxu_fs3rkqaUPpvsPcHy020Q35nx7yPX3RGf0T4n2g-2rV_Om336F6KVXOiIfYMkjbC7kwVkiOeMFLXDrhqaNI28cg-DqaLhyctdgNJQI2lm8z0O4Ek_4OCzj7J1NeunJBv8WI2vn4w3FBj0sV_PLOGAjrnOXtg8LbrChx4pqQj1jgWroPdbofUxfG6cYKIDGQB51TEUb3Hzq-Fnez3uxqDsz_aDWZ50br_IH6c5aE5OzyphbhPToYXB92JCtHWjF4-zpXeA1VWjUK46z4QeRWhWgpSZKyq2PXegHBm6V_VGlRbooc3Bd7FOlDHRUeVeirdpO-bEPzsL_ptkA_489E3LPEBBB2dsibeVdCBHco1WGCcPYBKSxtWQnLHmsYnnx6OCmdvLsh069IgS0IVoClLmqW4Ah4LvqImOLF_TvWi7kwXlQDAZV4Mg5Wkn1vaFKlQaBlW-Qg5DHzUqftvq4PaoVntJ9tkuPaZ7uV-1pv8_LY5LVWZ7lWidbnWaHPM_rvZb6qHW9opNM5C7JZCK3Mt_tNweZJVm5S48HLettnYkswU6R2Rhz6TbONysKYcBTus136XFlVIkmjMuHlBavML4VUsZdxJ9i0LocmiCyxFDg8D0NE5txa_njkIvdM_z-QRmy39cWuf5xV4nXb15iyPK0mKwGb04tcz_OR_ki5EtD3A7lpnKdkC_x6PnfuvfuT6xYyJeRcBDyZa7ocpJ_BQAA__9UbeIT">