<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/103707>103707</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[AVX-512] Consider writing `k` registers to memory when vectorizing procedures on their results?
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
Validark
</td>
</tr>
</table>
<pre>
I have some code that gets a bunch of bitstrings from k-registers, that LLVM ends up auto-vectorizing. However, the auto-vectorization of data in k-registers is much more expensive:
```zig
export fn foo(str: @Vector(64, u8)) @Vector(8, u64) {
const V = @Vector(64, u8);
const a: u64 = @bitCast(str == @as(V, @splat(16)));
const b: u64 = @bitCast(str == @as(V, @splat(17)));
const c: u64 = @bitCast(str == @as(V, @splat(18)));
const d: u64 = @bitCast(str == @as(V, @splat(19)));
const e: u64 = @bitCast(str == @as(V, @splat(20)));
const f: u64 = @bitCast(str == @as(V, @splat(21)));
const g: u64 = @bitCast(str == @as(V, @splat(22)));
const h: u64 = @bitCast(str == @as(V, @splat(23)));
return .{ a, b, c, d, e, f, g, h };
}
```
It gets a bit unwieldy: (Zen 4)
```asm
.LCPI0_0:
.zero 64,16
.LCPI0_1:
.zero 64,17
.LCPI0_2:
.zero 64,18
.LCPI0_3:
.zero 64,19
.LCPI0_4:
.zero 64,20
.LCPI0_5:
.zero 64,21
.LCPI0_6:
.zero 64,22
.LCPI0_7:
.zero 64,23
foo:
vpcmpeqb k0, zmm0, zmmword ptr [rip + .LCPI0_0]
vpcmpeqb k1, zmm0, zmmword ptr [rip + .LCPI0_7]
kmovq rax, k0
vpcmpeqb k0, zmm0, zmmword ptr [rip + .LCPI0_1]
kmovq r10, k1
vmovq xmm3, rax
kmovq rcx, k0
vpcmpeqb k0, zmm0, zmmword ptr [rip + .LCPI0_2]
kmovq rdx, k0
vpcmpeqb k0, zmm0, zmmword ptr [rip + .LCPI0_3]
kmovq rsi, k0
vpcmpeqb k0, zmm0, zmmword ptr [rip + .LCPI0_4]
kmovq rdi, k0
vpcmpeqb k0, zmm0, zmmword ptr [rip + .LCPI0_5]
vmovq xmm2, rdi
kmovq r8, k0
vpcmpeqb k0, zmm0, zmmword ptr [rip + .LCPI0_6]
vmovq xmm0, r10
kmovq r9, k0
vmovq xmm1, r9
vpunpcklqdq xmm0, xmm1, xmm0
vmovq xmm1, r8
vpunpcklqdq xmm1, xmm2, xmm1
vmovq xmm2, rdx
vinserti128 ymm0, ymm1, xmm0, 1
vmovq xmm1, rsi
vpunpcklqdq xmm1, xmm2, xmm1
vmovq xmm2, rcx
vpunpcklqdq xmm2, xmm3, xmm2
vinserti128 ymm1, ymm2, xmm1, 1
vinserti64x4 zmm0, zmm1, ymm0, 1
ret
```
The problem is that k registers have to be moved to general purpose registers before they can be moved to vector registers, and then they need to be combined in vector registers.
Looks like we can reduce instruction count by doing a round-trip to memory:
```zig
export fn bar(str: @Vector(64, u8), dest: *[8]u64) @Vector(8, u64) {
const V = @Vector(64, u8);
dest[0] = @bitCast(str == @as(V, @splat(16)));
dest[1] = @bitCast(str == @as(V, @splat(17)));
dest[2] = @bitCast(str == @as(V, @splat(18)));
dest[3] = @bitCast(str == @as(V, @splat(19)));
dest[4] = @bitCast(str == @as(V, @splat(20)));
dest[5] = @bitCast(str == @as(V, @splat(21)));
dest[6] = @bitCast(str == @as(V, @splat(22)));
dest[7] = @bitCast(str == @as(V, @splat(23)));
return dest.*;
}
```
```asm
.LCPI0_0:
.zero 64,16
.LCPI0_1:
.zero 64,17
.LCPI0_2:
.zero 64,18
.LCPI0_3:
.zero 64,19
.LCPI0_4:
.zero 64,20
.LCPI0_5:
.zero 64,21
.LCPI0_6:
.zero 64,22
.LCPI0_7:
.zero 64,23
bar:
vpcmpeqb k0, zmm0, zmmword ptr [rip + .LCPI0_0]
vpcmpeqb k1, zmm0, zmmword ptr [rip + .LCPI0_7]
kmovq qword ptr [rdi], k0
vpcmpeqb k0, zmm0, zmmword ptr [rip + .LCPI0_1]
kmovq qword ptr [rdi + 8], k0
vpcmpeqb k0, zmm0, zmmword ptr [rip + .LCPI0_2]
kmovq qword ptr [rdi + 16], k0
vpcmpeqb k0, zmm0, zmmword ptr [rip + .LCPI0_3]
kmovq qword ptr [rdi + 24], k0
vpcmpeqb k0, zmm0, zmmword ptr [rip + .LCPI0_4]
kmovq qword ptr [rdi + 32], k0
vpcmpeqb k0, zmm0, zmmword ptr [rip + .LCPI0_5]
kmovq qword ptr [rdi + 40], k0
vpcmpeqb k0, zmm0, zmmword ptr [rip + .LCPI0_6]
kmovq qword ptr [rdi + 48], k0
kmovq qword ptr [rdi + 56], k1
vmovups zmm0, zmmword ptr [rdi]
ret
```
This is definitely a win in terms of instruction count, but is it actually faster?
I don't know, but it's something to consider.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsmF1v8rgSxz-NuRkVJU7IywUXfVF1HqlHOhdH1WpvVk48gJfETm0HSj_9ykl4SSG0D_BotdJWlYF45jdjx07-Y2aMmEvEKZk8kMnTiNV2ofT0lRWCM70cZYpvpj9gwVYIRpUIueIIdsEszNEaYJDVMl-AmkEmrLFayLmBmVYlLO80zoWxqA2hj63Py8vrfwElN1BXwGqr7laYW6XFh5DzMfxHrXGFujXHvgGzQkkXhzPLQMhDPggDZZ0voFQaAd8rlEaskAT3xHsi3raNvPb_Q8zbK_heKW1hJmGmFKGJsZoE90BC77WJSmgShS6bOiE0JTTtdSVNjzNIgcQPLRIAIFfSWHgFEjwNsoIje-ZC11G4dcuEfWTGtmm5i911ZghNXh2JhJ6pCuZM_KhN8DQ6uwodn0PnV6GTE-iWy6_ipudSxmvQ1DuHnl2F9s-h51eh6Tn04ip0cAKt0dZawpjED8CcQ-aa3DXcNeiamWvmrlkAiZ923u57f8Me7uIf-yePsFDLtcCCb5ptS5PfUYLbj71t32cxU7ZXxi-P__vh_eHtnhLQ_Y0_UCtoNqsf9Wz9AVvorOOeNT1LTnq2wRfktGcdniNTr2c7OU-mfs86OkumPdv4C3LQdrpH62fDVZWXFb5lsPTc_f8oy-3nWmkOlVt2kwctKiD0AXZ3avI0gOn-lv73afERbVmq1RsAaPbu3JfeDXL2h6P4jfuym39YddffyzJwHS6L04757dKjw-nxfpSjqf6JKMFAFG3E2ZFcECrch9qP5OZRJscrcX_3aHP3uBiY1-Rm0xoNZPHeebsVdjqH9OSE7IfQbCOd7nKsZZUvizf-1hhu-VvD5vd5VDKM2iLoDvnFxG63xUpIg9oKnyaN5aZLa3OYFn2EYWCbmxG3Sy5_H2ZtGcGO2WcdD8fvhkMPZ_tzCp1bFL6H7vfB0tm692ZBoz3zcv3_AqHSKiuwdJq60exL2Ovspg6wCjKEUq2Qu-9zlKhZAVWtK2XwwDrDmdPjdoEbyJnsebWyHnolApPcGcvWQ2Jrmbmio8yERO5k_2fH8WH-L0otDRRiibDGJqZGXucIQhqr67ypIXJVSwvZBrgScg4MtKolv7Nue1kFJZZKb36idsiY_kbt8AgcjW11yj2ZPCRk8rQtHn5FUdFEmzy4l-btKooO6l8MHaglOi69mHuqkNhzg4u5A4VExw0v5Z6sIjro5GLoQP3QcaOLuQPFQ8eNL-aeqhzc66ktHhx-7PbK9-qCfwX-3yXw3fPvHyXw33peXDjDW4myIcH_OWbjk3yOPDTsm0j6kyn40Vc53ELpDwSn4S-YgPDncgjoLSfguDI4Gzz0fsEERMc10OngQ8vvrNdkt2BOiOG6MoNJtvvseypUNEe6HGdCCovFBhishXTyz6IuDajZsaJrjplq6xyFBZbbmhXFBmbMyUQSPPcOkYArSWhsYSnVeudpCY1Nc85tF04aWtXILsFRj0d8GvA0SNkIp35Mg8hPJ0kyWkzTDJHxYOb5PPbpLPFizCM_m8yyNEt57I_ElHo09BI_9GI_8JJxyv0opSzDmHpRRGck9LBkohgXxaocKz0fCWNqnPpeEHvxqGAZFqY5oadU4hqaXkLdwh3pqXO6y-q5IaFXCGPNHmOFLZqj_fvX3-4mfiOvHrsBwVoL6wZJIm9JIu9At-9EMKydGj84onfFQY681mhANTpdOClu6sIaEjyPal1MF9ZWxr0Q6DOhz3NhF3U2zlVJ6LNLrPu4q7T6E3NL6HMzHEPoczfe1ZT-FQAA__-bSqeg">