<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/89600>89600</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
vpand deferral eliminates earlier vpand elimination
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
Validark
</td>
</tr>
</table>
<pre>
In my code, I have the following function:
```zig
export fn expand8xu8To16xu4AsByteVector(vec: @Vector(8, u8)) @Vector(16, u8) {
return std.simd.interlace(.{ vec, vec >> @splat(4) }) & @as(@Vector(16, u8), @splat(0xF));
}
```
Here is the Zen 3 assembly:
```asm
.LCPI0_0:
.zero 16,15
expand8xu8To16xu4AsByteVector:
vpsrlw xmm1, xmm0, 4
vpunpcklbw xmm0, xmm0, xmm1
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
ret
```
This implementation avoids an extraneous `vpand`, which LLVM unfortunately inserts in an implementation like:
```zig
export fn expand8xu8To16xu4AsByteVectorBad(vec: @Vector(8, u8)) @Vector(16, u8) {
return std.simd.interlace(.{ vec & @as(@Vector(8, u8), @splat(0xF)), vec >> @splat(4) })
}
```
```asm
.LCPI0_0:
.byte 15
.byte 15
.byte 15
.byte 15
.byte 15
.byte 15
.byte 15
.byte 15
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.LCPI0_1:
.zero 16,15
expand8xu8To16xu4AsByteVectorBad:
vpand xmm1, xmm0, xmmword ptr [rip + .LCPI0_0]
vpsrlw xmm0, xmm0, 4
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_1]
vpunpcklbw xmm0, xmm1, xmm0
ret
```
The optimization in my first version is helpful because x86 does not have a `vpsrlb` instruction, it only has `vpsrlw`, which means the lower byte in each 2-byte pair will have 4 bits shifted in from the upper byte, requiring us to use a `vpand` to zero out the bits we don't want. But in this case, because we have to `vpand` to get the lowest 4 bits of the other vector we are interleaving anyway, we may as well interleave first, and then isolate the lowest 4 bits of all the bytes simultaneously.
Next, I define this function:
```zig
fn foo(vec: @Vector(16, u8)) [2]@Vector(16, u8) {
const vec2 = vec + vec;
const vec3 = vec2 | @as(@Vector(16, u8), @splat(1));
return @bitCast(std.simd.interlace(.{ vec2, vec3 }));
}
```
Zen 3 emit:
```asm
.LCPI1_0:
.zero 16,1
foo:
vpaddb xmm0, xmm0, xmm0; equivalent to multiply by 2 or shift left by 1
vpor xmm1, xmm0, xmmword ptr [rip + .LCPI1_0]
vpunpckhbw xmm2, xmm0, xmm1
vpunpcklbw xmm0, xmm0, xmm1
```
The problem comes when I compose the two aforementioned functions:
```zig
export fn baz(x: u64) [2]@Vector(16, u8) {
return foo(expand8xu8To16xu4AsByteVector(@bitCast(x)));
}
```
```asm
.LCPI2_0:
.zero 16,15
.LCPI2_1:
.zero 16,30
.LCPI2_2:
.zero 16,1
baz:
vmovq xmm0, rdi
vpsrlw xmm1, xmm0, 4
vpand xmm1, xmm1, xmmword ptr [rip + .LCPI2_0]; Unnecessary! We wanted to avoid this instruction!
vpunpcklbw xmm0, xmm0, xmm1
vpaddb xmm0, xmm0, xmm0
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI2_1]; LLVM tries to be cool here and use `0xF << 1` after `xmm0 + xmm0`
vpor xmm1, xmm0, xmmword ptr [rip + .LCPI2_2]
vpunpckhbw xmm2, xmm0, xmm1
vpunpcklbw xmm0, xmm0, xmm1
```
Expected emit:
```asm
.LCPI2_1:
.zero 16,30
.LCPI2_2:
.zero 16,1
baz:
vmovq xmm0, rdi
vpsrlw xmm1, xmm0, 4
vpunpcklbw xmm0, xmm0, xmm1
vpaddb xmm0, xmm0, xmm0
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI2_1]
vpor xmm1, xmm0, xmmword ptr [rip + .LCPI2_2]
vpunpckhbw xmm2, xmm0, xmm1
vpunpcklbw xmm0, xmm0, xmm1
```
Alternatively (a less cool, but straightforward concatenation of the implementations of `expand8xu8To16xu4AsByteVector` and `foo`, where we `vpand` with `0xF` before the `vpaddb` rather than `vpand` with `0xF << 1` after the `vpaddb`):
```asm
.LCPI2_1:
.zero 16,15
.LCPI2_2:
.zero 16,1
baz:
vmovq xmm0, rdi
vpsrlw xmm1, xmm0, 4
vpunpcklbw xmm0, xmm0, xmm1
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI2_1]
vpaddb xmm0, xmm0, xmm0
vpor xmm1, xmm0, xmmword ptr [rip + .LCPI2_2]
vpunpckhbw xmm2, xmm0, xmm1
vpunpcklbw xmm0, xmm0, xmm1
```
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzUWF1v2zoS_TX0y6CGRPlDefBD7TTYAt3FPtztAvuyoKSRxS1F6pKUZffXL4a0XTu1E7vJvcAtiiiiRjMUeeacwwjn5FojLth0yaaPI9H7xtjFV6FkJey3UWGq3eKzhnYHpamQ8RV8hkZsEHyDUBulzCD1Gupel14azbKPLHlkyeHnLIn_v8t1HMFtZ6yHWgNuO6GrfNvnv5l0tu0nH91y5_Erlt5YxvMNliz7CGySHIdyqt_njD8w_nD2JJ0dHwGbL2MtAACLvrcanK_GTrbVWGqPVokSGc_HbL4EKsNXdAGWfWLZJ8rrOiU84_kk5nsMFz6jR8Ixnl8pTb-cvJ1sn-JUWbafEaU6X5jT1fobWgTpwtL-BzVkIJzDtlC7a8sqXBtHxl9W__yc_Dc5BsL-3_g7WgNhjun0uAUvLPzzBJvOWTUAbNs2pe_btm1C18k-bNP1uiu_qWKI8YfnJ9f0eUahq8uRg7EVdN4Cmy6t7IDxJRw_bfp4nseif2Exf2ukA9l2ClvUXhA4QWyMrBwIAp-3QqPpHbBZEmZE7_MVDI0sG_jy5evfode1sb7XwqPagdQOrXcgNSV4llnJb_he2F-K6k-F_1Vk568D-5bOuQX69yC62HkEgAOcfx64FnjT-PtkOZ1e6ECA9FJj3jf8Dil-DD-73693-mYGIfj-TCLHlk_f0vInZJRcJqNfoZb0Qp0jqZ3x2CHlXTyEYDovW_k9UoUMalpL6zxs0Low5qBB1dW9ggJL0TuEbT6DyqADbXxUXBGpyllVsFlCfORtH1WXr0B6MFrtoBHuGDeccVqLQkdxUWZACwGoUgOKsgH-Idx2QloYpFKx5AQK6R24RtYeKwqurWlDjr7r9jmogsXfe2nJCfQOvAH6AnFKrTQYgGR6H94PiQeEiqY_9zAI7cew7D1V8UTepXAh92FFBtw7D_M88Rr98bucP8za1GHQ-AYtsZQ3lnIIEtlAgyg2NGOhd4PYhXVCaMUOBE1MqR9RGLeLQghfvkHaMqOEx8t1hVLxI3ceHTjZ9spHuVG78Sk6_oFbH01VhbXUGL_8ZjdVa6iNuawWZ77kgWDPCee3qEZpdABnyYFlj3uNiFYpuxSWHcI4sPnqTpeUPvNIB81ik6SQfiVo3fMXBYzvZSg7CM7Njiu6LGylv81hpVcdFuwZcr8vxvyI23SiqoqLpJSwbAnUOhuhUHvCMkFFdmoHxQ44GBubDxTWnobSY1Jj4U5GTS8yamC6JjIdf8253er1rlJhZ02hsIXStOhgoE76TDedcbGX_GBA1MYGcyWNxurYDu4Oh1WI74znW2qKfja5qwH2-IuN9dop5Qyl24MruhV_V2DGT2F2jq_pWdg1td5HZ8lZNL-WNA7Skv0k3K3Z_H6y07aSFxX52unguv6nr6GVR7RmS_iX1liic8LuGE_h3xjEAitql2DpI2ueCeJ15N50QLnesG8-yvDoN7JlPGR4KzFIZoFQGqOgoVMgpSXNY7Mk2T4By1YsW0FKeidqj5YeULGQNlSd_TS1-wmCIPICQRx6_naWuJMgPm07LGljb2fkv1IP_MJBmXD4h8LwfTHzJrD8iqR8VB6tFl5u6IjOeC5AoXOhk4J37D04b4VcN742dhC2IttSCo862vG9Szw_zwcTx2bJy-xPzagriiOtOFhtat8BzyzqIH2zb2W6LZD0LZSNUVUVHL0Vwar6Ruhrr19ggudpgvrc3TovqQy_yfNcb53XGucP6Zy3NcQ9EvDX4NlRtciqh-xBjHCRztNsnk_mKR81i2kx4RzTfD6pa4HpLKkmRT6p8ynm03RWZyO54AmfJBPO0zTNMz7OBJ_hbJql-bxMS16xSYKtkGqs1KYdG7seSed6XOQPsyQZKVGgcuEPzJxrHCA8ZJyWYGQX9M6Hol87NkmUdN79yOKlV7iIe1lhjdYKBahkK7WgcxUKqySd7ULE4YE0etRbtWi874Jj5E-MP62lb_piXJqW8SeqsL986Kz5H5ae8acwL8f4U5j3_wMAAP__XrGFow">