<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/91302>91302</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
SIMD instructions that write directly to `k` registers on AVX512-enabled machines are slower than using AVX/AVX2 equivalent-routines
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
Validark
</td>
</tr>
</table>
<pre>
I wrote some test code:
```zig
const Chunk = @Vector(32, u8);
export fn foo(vec1: Chunk, vec2: Chunk, vec3: Chunk, vec4: Chunk) Chunk {
const true_vec = @as(Chunk, @splat(0xFF));
const false_vec = @as(Chunk, @splat(0));
return @select(u8, vec1 == vec2, true_vec, false_vec) |
@select(u8, vec3 == vec4, true_vec, false_vec);
}
```
```llvm
define dso_local <32 x i8> @foo(<32 x i8> %0, <32 x i8> %1, <32 x i8> %2, <32 x i8> %3) local_unnamed_addr {
Entry:
%4 = icmp eq <32 x i8> %0, %1
%5 = icmp eq <32 x i8> %2, %3
%6 = or <32 x i1> %5, %4
%7 = sext <32 x i1> %6 to <32 x i8>
ret <32 x i8> %7
}
declare void @llvm.dbg.value(metadata, metadata, metadata) #1
```
For Zen 3, I get the following emit:
```asm
foo:
vpcmpeqb ymm0, ymm0, ymm1
vpcmpeqb ymm2, ymm2, ymm3
vpor ymm0, ymm2, ymm0
ret
```
For Zen 4, I get the following emit:
```asm
foo:
vpcmpeqb k0, ymm0, ymm1
vpcmpeqb k1, ymm2, ymm3
kord k0, k1, k0
vpmovm2b ymm0, k0
ret
```
[Godbolt link](https://zig.godbolt.org/z/dKGo9n9KW)
On one hand, I can see how the LLVM IR trivially maps to the AVX512 way of doing things. However, I wonder if this isn't a step backwards in terms of performance. On Zen 4, [according to uops.info](https://uops.info/table.html?search=vpcmpeqb&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_ADLP=on&cb_ZEN4=on&cb_measurements=on&cb_doc=on&cb_base=on&cb_aes=on&cb_avx=on&cb_avx2=on&cb_avx512=on), [VPCMPEQB (YMM, YMM, YMM)](https://uops.info/html-instr/VPCMPEQB_YMM_YMM_YMM.html) has a latency of 1 and a throughput of 0.25. On the other hand, [VPCMPEQB_EVEX (K, YMM, YMM)](https://uops.info/html-instr/VPCMPEQB_EVEX_K_YMM_YMM.html) has a latency of 4 and a throughput of 0.5. The `vpor` and `kord` instructions both have a latency of 1. The [VPMOVM2B (YMM, K)](https://uops.info/html-instr/VPMOVM2B_YMM_K.html) also has a latency of 1 on Zen 4.
That means that for the Zen 3 code, we can run both of the `vpcmpeqb` in 1 cycle, and then the `vpor` in the next cycle. That's 2 cycles in total. For the Zen 4 code, we can run both of its `vpcmpeqb` instructions simultaneously, but it will take 4 cycles, then the remaining 2 instructions will take 1 cycle per each, for a total of 6 cycles. So the AVX/AVX2 code should be ~3 times faster, based on an analysis that exclusively considers latency and throughput. The AVX512 code is also more constrained in terms of port usage too.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy0V99zozgS_mvkl65QIMA_HvyQxPHeVNY3c7dbudl9SQloG52F5JGEHe_D_e1XLbCNPdnsbtUd5QTUdH_q_r5GCOGc3GjEOcsfWL4YidbXxs5fhJKVsNtRYarj_BMcrPEIzjQIHp2H0lTI0nsWL1h8_X8cd7_f5KazlEY7D491q7fA0gWwLH7B0hvL-DTljD9CO2V8xtKHIQ6-7Yz1sNawNobx6R7LhKX3HQ4F7bHkt4b01pANDLNTDpN-IgCALjlvW3zdY3nKTzjGp2cclsVup4RnfBq_LZeU6yDdC8paKPcnYd7BsOhbq4MXKizJjXgJZSSESKChaP54Tpiuz9NSiWzyeIHsj3ch0wFk9hHkRZfJ4kbiD9RXat90pgrXUiNUzrwqUwoFLH1MObyBnLL0iZLrBL4x8zwOpN1ak3et_F1rSoyESV9brUWD1auoKnvpgCft7fHcxkAxWRBPls0O8Nv3kF1SlMYlJP84hPch6SBkHEKMvXgnvXfee2cD70nwdvjmv_cfgzfXc54CLfrvk5ncatlrVCphEfZGVqQIqRdVxSbaC9Ui49MGvaiEF5Tc-9czYDxNPmiQpbHwK2pIKe4TbNCDrxHWRilzkHoD2Eh_u6ScgITru4l65aJXd-x3ZbPDbwUcmyboMzgnv-PaH8em4b3n6ZzeRhg7BOTnCa7cLPo_UXv2_6u9P7Z_kYBt8nH5W2OrC27nvY1vMRuzb_iQ1Pg9v485YvnDD6YqjPKgpN6yfMH4tPZ-56hovmR8-ZvcRJvOJzJ2QxbGl9XzD2amZ8__osVqAPhZg9EItdBVR3spNDhEqM0h0P_jjy8r-PRP8FbupVDqCI3YOXqe6O79y9c84XAQRzBrqAzJ5GupNy6Cv5kD7tF2sAejK7Qg13TbgXSa8YkHAc7jDgpRbg_CVg6kBo-2cQS3Q7s2thG6xAg-60tzsPxBlKWxVZjOQGt2LpJ6bd7j43KTL70oFEa1bxRLlw6FLWuWLk5yMz4ui1d6-aQLo7uR3w0GBDUY0rt3OL5f_PhlMPz16e_ZYNigcK3FBvVVUGXKwagQDgdDgUNXsX-7HvHrYZ6cDLOepZcvj6svT_94AManv6xWZB2eZn_EFzF1J7XzlvHlCez1l9Xq9NdxyWdQCwcClPCoy9ALCQhdgQBfW9Nu6l3ryRpHPA9aUvMYX6M9t94g3denl6evlPPz_yhjwnt9_uOks99JOo_g5xqBjWNa6dg4Dn5sHNOTT8MwY1t6abSDwvgaarHHG0Z6EKpz9fllxYeyPP_12jqMUNTzuSShnHlPDNM_PtHw4f-5Fh4aFNqBp8u1sUGY8Abqdq_8EQ4YFgXb6q4ysw5OgYz-wQkMQALlsVQhhujxNeqLZ0eb7Cya3tLBmTih3d7EAe8s3RpgvFARLAcJZR8lJL37PqGBJE42rfJCo2mdOhJI0XqQHg5SKfBii4Qfpg_bvFPqFhshNS0z_BrwEtdXTasVoCjrsDU0lrqIiqDsxj10BD-dl03Gl_cvX3koClxtWlVBgfCfFLxs0MFaON8tnrQmVCSgoJ9QRyd7vfCtVK2Te1THsLuWFVp31r3T4NTIXfP163WYVLquWxpjsducWyE1VtdrMH1ftE5sELwx0aiap9UsnYkRzpNJksfjcTzNR_V8Ok2ybLKOZ3kuikIU02k-xTTPJmmWJ0VSjOScxzyL83gST5IsyyMsRZFOOWZYxUnJU5bFRLaKwr7K2M1IOtfifJakMR8pUaBy4fuLc40HCDcZ5_Q5ZucUc1e0G0f7Mum8u6B46RXOf_q0WlwrGBg8WOkRKmmx9OoYdonjeEvtY3EjSQFHzHe03aGm90cFjShrqdEBbQedMgekNhUaWketMlAXv7VyLxRqf2dN6ylo1Fo1v37IN9LXbRGVpmF8GT4KutPdzpp_h0-SZSjXMb4MdPw3AAD__0z-Uks">