<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/107700>107700</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[Aarch64] `bitcast i16 to <16 x i1>` + `sext <16 x i1> to <16 x i8>` should not go bit-by-bit
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
Validark
</td>
</tr>
</table>
<pre>
I can get LLVM to do a `bitcast i16 to <16 x i1>` + `sext <16 x i1> to <16 x i8>` like so:
```zig
export fn unmovemask16(x: u16) @Vector(16, u8) {
return @select(u8, @as(@Vector(16, u1), @bitCast(x)) == @as(@Vector(16, u1), @splat(1)),
@as(@Vector(16, u8), @splat(0xff)),
@as(@Vector(16, u8), @splat(0)));
}
export fn unmovemask64(x: u64) @Vector(64, u8) {
return @select(u8, @as(@Vector(64, u1), @bitCast(x)) == @as(@Vector(64, u1), @splat(1)),
@as(@Vector(64, u8), @splat(0xff)),
@as(@Vector(64, u8), @splat(0)));
}
```
Unfortunately, LLVM doesn't currently do anything smart for this. Here is the 16-bit version (compiled for Apple M2):
```asm
unmovemask_normal: // @unmovemask_normal
sub sp, sp, #16
sbfx w8, w0, #1, #1
sbfx w9, w0, #0, #1
fmov s0, w9
mov v0.b[1], w8
sbfx w8, w0, #2, #1
mov v0.b[2], w8
sbfx w8, w0, #3, #1
mov v0.b[3], w8
sbfx w8, w0, #4, #1
mov v0.b[4], w8
sbfx w8, w0, #5, #1
mov v0.b[5], w8
sbfx w8, w0, #6, #1
mov v0.b[6], w8
sbfx w8, w0, #7, #1
mov v0.b[7], w8
sbfx w8, w0, #8, #1
mov v0.b[8], w8
sbfx w8, w0, #9, #1
mov v0.b[9], w8
sbfx w8, w0, #10, #1
mov v0.b[10], w8
sbfx w8, w0, #11, #1
mov v0.b[11], w8
sbfx w8, w0, #12, #1
mov v0.b[12], w8
sbfx w8, w0, #13, #1
mov v0.b[13], w8
sbfx w8, w0, #14, #1
mov v0.b[14], w8
sbfx w8, w0, #15, #1
mov v0.b[15], w8
add sp, sp, #16
ret
```
Here is the emit for the 16-bit and 64-bit version compiled for x86-64 Westmere:
```asm
.LCPI0_0:
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 1
.byte 1
.byte 1
.byte 1
.byte 1
.byte 1
.byte 1
.byte 1
.LCPI0_1:
.byte 1
.byte 2
.byte 4
.byte 8
.byte 16
.byte 32
.byte 64
.byte 128
.byte 1
.byte 2
.byte 4
.byte 8
.byte 16
.byte 32
.byte 64
.byte 128
unmovemask16:
movd xmm0, edi
pshufb xmm0, xmmword ptr [rip + .LCPI0_0]
movdqa xmm1, xmmword ptr [rip + .LCPI0_1]
pand xmm0, xmm1
pcmpeqb xmm0, xmm1
ret
.LCPI1_0:
.byte 1
.byte 2
.byte 4
.byte 8
.byte 16
.byte 32
.byte 64
.byte 128
.byte 1
.byte 2
.byte 4
.byte 8
.byte 16
.byte 32
.byte 64
.byte 128
unmovemask64:
movq xmm3, rdi
punpcklbw xmm3, xmm3
pshuflw xmm0, xmm3, 80
pshufd xmm0, xmm0, 80
movdqa xmm4, xmmword ptr [rip + .LCPI1_0]
pand xmm0, xmm4
pcmpeqb xmm0, xmm4
pshuflw xmm1, xmm3, 250
pshufd xmm1, xmm1, 80
pand xmm1, xmm4
pcmpeqb xmm1, xmm4
pshufhw xmm2, xmm3, 80
pshufd xmm2, xmm2, 250
pand xmm2, xmm4
pcmpeqb xmm2, xmm4
pshufhw xmm3, xmm3, 250
pshufd xmm3, xmm3, 250
pand xmm3, xmm4
pcmpeqb xmm3, xmm4
ret
```
Here we have two strategies for accomplishing this task for 16 byte vectors. In the first, we use a `pshufb`, which in ARM-land could be replaced by `tbl`, to broadcast the first byte to the first 8 bytes, and the second byte to the second 8 bytes. Then we load up a mask, and `pand`+`pcmpeqb` to turn each bit into a byte. On ARM we have equivalents both of those instructions, but we also have `cmtst` which can do both of those steps in one.
The second strategy does `punpcklbw` on the 8 byte bitstring, which interleaves it with itself, which is equivalent to `zip1` on ARM. Then it uses `pshuflw xmm, xmm, 80`+`pshufd xmm0, xmm0, 80`. I don't think we have support for `tbl`-with-constant on ARM, so I think we have to just use the first strategy.
In open-code, to produce the second, third, and fourth vector, we might want to reuse the same vector (`0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1`) to pass into the `tbl` instruction, so we could take the scalar input and shl by 16, 32, and 48, and move them into vector registers separately. Alternatively, one could use `ext` once they are already in vector registers.
In a loop, however, it might make more sense to have a vector that holds each of the following:
```
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7
```
That way, we can `tbl` with each of them in parallel, and run 4 `cmtst`s in parallel afterwards.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJysWVuP4joS_jXulxIoTkIIDzz0ZVvb0ox2dTQ7-3jkJAXxacfO2A6X8-tXdgIkXAIzO-NRA-Xy56rPX5nYMGP4WiIuyeyFzN6eWGNLpZffmeAF059PmSr2yw_ImYQ1Wvjy5ftXsAoKBQxIEmTc5sxY4DRxZhK90gR2wCmJ_kGSAEj44twM7uywc-Cddt6CfyIYRaJnEryR4PA3Cdr_f_N1a8FdrbSFlYRGVmqDFTOfNCFhuiPRMzTu7QJIHHzH3CpNwtRZXqFJvX3-0qIAAGi0jZbO16DA3JIwdV6vzsIMCdMrKJSEi84l4_aVGetndsYFkOiNRG8PDje1YG5wa3LWLrKR4enl8GC3Wp0jdP9-EuiAsiBRRxKZv_XX4hrzSXxk3r0dMu8sl8z_FO0dxK_Sfjn8Ju0jpPXyeID9n0QYo_0g_v4q_EeulLaNZBbF3oH5siwUGknCuYW80RqlFXtfp3JvSy7XYCrmlk5psCU3U_gnagRuwJYINJlk3MIGteFKAgnTXFU1F1j4Ac91LRC-hj7E68XJTNVaTsL4UypdMeGUQcJ3Er67pC-7O-5Nk4GpXTbtXxJGNDl2ZqsdbL1EtsGh9_g6cFoMnIJzp1WlNmC8ebs4GJ1tE0wzMnuhZPbmO9OxucNz2B5C-BBCNIIQnSNcA4hHAOKHQpiNIMweQkhGEJKHEOYjCPOHENIRhPQRIhcjAIuHQqAXKusLKngM40LOfYzHVEnHZEkf0yUdEya9UOZ1jDFt0sfEScfUSS_lyYri5vah0Y7spv19ECt-2CKPmyKTBSTxYH8cbI67NJkkMfwXja1Q490Ncvrl9d8fwZ_B0fHwvTPN9hYBoIvu4vMNt0fMvwHiMi563e3XzL8RlA54pjd5voEZXjPH133TG8jJdXsUnmWb3MCl4S3ka-arEcOvxnwR7Jnj3ZgHD-VH8iu1KZz7rqp8jWPBhzi1KZtVdnLYVdVW6QJqq4HMXjSv_XHiWD6zt-F4N8EP5sfT--PpaXztKhwGE5_RXOdVjT-yax6nveUkOjpS3HdE93-u3cD-ywt4K-bRCvm5mO9WyEMxX5FcEl9QX6nNj259_ZebvlBeI-v8U2Tbo4t_vVSn2PYl4D3T4IpfMRBTcM2vp9b4nlrpFbV70fYmie8q9tzjlBDtJxTObmdEj-K_lvmpjOjZlL1ozrv6k5Q-mvABentu4fWgT8GEdxm65XGKKXqUoTuOp6iiu1GdezzwFLNFKNkGwW4VGKuZxTVH459SWO4eWgQ3_izoToBgmfn0fTQBX04bf1Y1U_iQ_gFoxbU7Zb863MZge-XTbtNudtdR8rwELuH5j68T4bLLVSMKyBA01oLlWEC2d8NsJroxVkGmFSv8xdFxmjYCq3qW1NuMG-OQXYfBXMli4NuZOucpfCtRuoCFYgU0NTBw28IBxCXAZOFDeXEfWr5JEni8RktAlpfgnva4tAqYx53Cv3yOR4bxR8M3TKC0BjJlS1ArsKUyCFwaq5vcciV96Flj3SgmjGqHkiTIK2usm7PlL2fSndSHOMZibRy1SuK0v9LfTkl3a7z3Z3-f22Enc-CqXcWWGZeRsZrLdX_dLGqBbIMGuIUttyVwa1Csej6ml6u_tUuCv3lNuwme__jaUc6t04g5SqTdWjoNd6V8ZP32JpkEU_iAQrU3Gbbk8vNIumnq9vpJ6ZOkJi7sSa6ksUzaLiZ_DlDwcQZgFfzVGB9oT2cHFgcsf0hQNcpJrgrsVFtrVTQ59lTnO0qui4O8VqrRtuwKqauciq9LC1vW0qfxMLlh1aHkgIQpSYIAho0Om6dv4SNhxrTydEBHKvra6xjYYleRln120-ZMMA1c1k17tjGlcDXaXgu2NxvOHKeHd-6r1Q2t2im7kDWuubGoDRismfZ3UVN4Fha1ZJZvuqspJQ8RuLxJEuDOttppmdwD0646NLJi7-R-Dn--KgyEUv6IV6otbtDTzG1Hc-XSrJR2KySNX3G_8uyAa0tmoVSiMG2h-3pDWCkh1NZVx43jW_vx3gp5pxCGLRq21imGYZsNW-uUwLDNh23k6-CbS3PL9p0G3Q5zkokv9F72bmHBraEQKA6rrhsJcX-zMn0vYCuLest0YaZPxTIqFtGCPeGSzsPZnM5jmjyVS1zEbBXlNAvyVZpRmuBqvgiKImNZHs5j9sSXYRDGwSKY02iWzJJphOmCLXKWJXkW5XFB4gArxsVUiE01VXr9xI1pcEmD-TwIngTLUBj_20UYStyC7yVhSGZvT3rpBk2yZm1IHAhurDnBWG4FLt2T1jPTeZnEZPb2-3_QMKVXvlQW1srtwJNsP8m4fWq0WJbW1sbJzd-Nrrktm2yaq4qE7y7M7mVSa_WXvyJ_98kZEr532W-W4f8CAAD__0IC9-g">