<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/136519>136519</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[X86] Use `vpinsrq` in building 2-element vector of 64-bit int loads
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
dzaima
</td>
</tr>
</table>
<pre>
For building a two 64-bit element vector, clang currently does separate loads and packs them together, e.g. this code:
```c
typedef uint64_t u64x2 __attribute__((vector_size(16)));
u64x2 generic_int(uint64_t* a, uint64_t* b) {
return (u64x2){*a, *b};
}
__m128i intrinsics(uint64_t* a, uint64_t* b) {
__m128i lo = _mm_loadu_si64(a);
return _mm_insert_epi64(lo, *b, 1);
}
__m128i intrinsics_int_domain(uint64_t* a, uint64_t* b) {
__m128i lo = _mm_loadu_si64(a);
__m128i t = _mm_insert_epi64(lo, *b, 1);
return _mm_add_epi64(t, t);
}
```
via `-O3 -march=haswell` compiles to:
```asm
generic_int:
vmovsd xmm0, qword ptr [rsi]
vmovsd xmm1, qword ptr [rdi]
vmovlhps xmm0, xmm1, xmm0
ret
intrinsics:
vmovsd xmm0, qword ptr [rsi]
vmovsd xmm1, qword ptr [rdi]
vmovlhps xmm0, xmm1, xmm0
ret
intrinsics_int_domain:
vmovq xmm0, qword ptr [rsi]
vmovq xmm1, qword ptr [rdi]
vpunpcklqdq xmm0, xmm1, xmm0
vpaddq xmm0, xmm0, xmm0
ret
```
even though the load of `b` could be done together with the packing via `vpinsrq` for integer domain, and `vmovhps` for unspecified domain if preferring float is desired, i.e.:
```asm
vmovq xmm0, qword ptr [rdi]
vpinsrq xmm0, xmm0, qword ptr [rsi], 1
```
Additionally, per uops.info data, post-icelake, `vpinsrq` has higher throughput than `vmovhps`, and via some local microbenchmarking on Haswell I don't see any domain crossing penalties for either in any direction, so for it could make sense to always use `vpinsrq` and never `vmovhps` (or at least on the applicable targets).
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJzUVk9v47gP_TTKhUhgK46THHJIpyh-v9OeFtibIVu0za0suRKdTObTL-Q_mUxbzM4e9rCFgSbxI_X4-ERQhUCNRTyJ3ZPYPa_UwK3zJ_1NUadWpdO304vzUA5kNNkGFPDVQZ6tS2JAgx1ahgtW7LyQX6AyyjZQDd6jZXMD7TBAwF55xQjGKR1AWQ29ql4DcIsdsGuQWxzDcdNsgFsKUDmNYnsWyfjkyfRUIjnzrUeNNQxkOc8KhiHPvkooCsXsqRwYi0LIg5CHiVYR6BsKeUhzIY_zs30SyXmKa9Cip6ogy0IelqRCnkFFRo8_lEIeQexjLACARx68hRgVM8W8-ychz2OckOdS7J-nk-KHsZCi6FJ5ICDLnmygKvzDM5cExoHYPkPRdUUUdSgC5ZmQB3Wv7oFhRJEN6LnAfsIZdycpv0B6j_oJ06hQoV2nyP6rpBc038G_yP1dxUrrewhHJL-vcnGVSM4XUiDyZP3bFtad8lUrts-tClc0RuQJVK7ryWAAdrMpl1gVOpGcH000vof579K5S9DwteuSSOHt6ryGnj2I3ZMPJHbPH8ERnX5E6wUdYabtwxK2JF_Cxu_fs3rkqaUPpvsPcHy020Q35nx7yPX3RGf0T4n2g-2rV_Om336F6KVXOiIfYMkjbC7kwVkiOeMFLXDrhqaNI28cg-DqaLhyctdgNJQI2lm8z0O4Ek_4OCzj7J1NeunJBv8WI2vn4w3FBj0sV_PLOGAjrnOXtg8LbrChx4pqQj1jgWroPdbofUxfG6cYKIDGQB51TEUb3Hzq-Fnez3uxqDsz_aDWZ50br_IH6c5aE5OzyphbhPToYXB92JCtHWjF4-zpXeA1VWjUK46z4QeRWhWgpSZKyq2PXegHBm6V_VGlRbooc3Bd7FOlDHRUeVeirdpO-bEPzsL_ptkA_489E3LPEBBB2dsibeVdCBHco1WGCcPYBKSxtWQnLHmsYnnx6OCmdvLsh069IgS0IVoClLmqW4Ah4LvqImOLF_TvWi7kwXlQDAZV4Mg5Wkn1vaFKlQaBlW-Qg5DHzUqftvq4PaoVntJ9tkuPaZ7uV-1pv8_LY5LVWZ7lWidbnWaHPM_rvZb6qHW9opNM5C7JZCK3Mt_tNweZJVm5S48HLettnYkswU6R2Rhz6TbONysKYcBTus136XFlVIkmjMuHlBavML4VUsZdxJ9i0LocmiCyxFDg8D0NE5txa_njkIvdM_z-QRmy39cWuf5xV4nXb15iyPK0mKwGb04tcz_OR_ki5EtD3A7lpnKdkC_x6PnfuvfuT6xYyJeRcBDyZa7ocpJ_BQAA__9UbeIT">