<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/66159>66159</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Missed optimization on aarch64: decompose 64 byte vector into 4 16 byte vectors automatically
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
Validark
</td>
</tr>
</table>
<pre>
I wrote this code for [simdjzon](https://github.com/travisstaloch/simdjzon/):
```zig
fn is_ascii(input: u8x64) bool {
return 0 == @as(u64, @bitCast(input >= @as(u8x64, @splat(0x80))));
}
```
For those who can't read Zig, I will explain what it is doing piece by piece. First, we create a vector of 64 bytes, where each byte is `0x80`, i.e. `128`. Then, we take the input vector of 64 bytes, and we check if each byte is greater than or equal to `0x80`. The code then produces a 64-bit bitmap with a 1 corresponding to the places where there was a byte greater than or equal to `0x80`. We return true if that bitmap is 0, otherwise we return false.
Here is my attempt at an ascii diagram depicting this, albeit just for an 8 byte vector:
```
┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│ 0x61 │ 0x62 │ 0x63 │ 0xF0 │ 0x9F │ 0x98 │ 0x8E │ 0x64 │
├────────┼────────┼────────┼────────┼────────┼────────┼────────┼────────┤
│ >=0x80 │ >=0x80 │ >=0x80 │ >=0x80 │ >=0x80 │ >=0x80 │ >=0x80 │ >=0x80 │
├──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┤
│ 0 │ 0 │ 0 │ 1 │ 1 │ 1 │ 1 │ 0 │
└───┬────┴───┬────┴───┬────┴───┬────┴───┬────┴───┬────┴───┬────┴───┬────┘
│┌───────┘ │ │ │ │ │ │
││┌───────────────┘ │ │ │ │ │
│││┌───────────────────────┘ │ │ │ │
││││┌───────────────────────────────┘ │ │ │
│││││┌───────────────────────────────────────┘ │ │
││││││┌───────────────────────────────────────────────┘ │
│││││││┌───────────────────────────────────────────────────────┘
00011110 == 0
```
In effect, if ANY byte is `0x80` or above, we get false. If all bytes are less than `0x80`, we get true.
On Zen4, we get this output: :+1:
```asm
vmovdqa64 zmm3, zmmword ptr [rbx]
vpmovb2m k0, zmm3
kortestq k0, k0
jne .LBB4_811
```
`vpmovb2m` does exactly what we are looking for. It grabs the upper bit from each byte in our 64 byte vector and produces a bitmask which we test with `kortestq`.
On Zen 3, we get this output: :+1:
```asm
vmovdqa ymm4, ymmword ptr [rax]
vmovdqa ymm3, ymmword ptr [rax + 32]
vpor ymm5, ymm4, ymm3
vpmovmskb eax, ymm5
test eax, eax
jne .LBB4_938
```
Zen 3 does not have 64 byte vectors, so we read in two 32 byte vectors and OR each element together first. This is a great optimization because all we care about is whether ANY of the highest bits in each byte were set, we don't care which one. `vpmovmskb`+`test` are basically the 32 byte equivalent to `vpmovb2m`+`kortestq`.
In Zig code, the Zen 3 behavior could be asked for explicitly, via:
```
fn is_ascii(input: u8x64) bool {
const bytes: [64]u8 = input;
const a: u8x32 = bytes[0..32].*;
const b: u8x32 = bytes[32..64].*;
const non_ascii_mask: u8x32 = @splat(0x80);
const mask: u32 = @bitCast((a | b) >= non_ascii_mask);
return mask == 0;
}
```
Unfortunately, the aarch64 backend cannot do a similar transformation automatically. When compiling the one-line implementation, it produces this monstrosity: :-1:
```asm
cmlt v7.16b, v4.16b, #0
umov w9, v7.b[1]
umov w10, v7.b[0]
and w10, w10, #0x1
bfi w10, w9, #1, #1
umov w9, v7.b[2]
bfi w10, w9, #2, #1
umov w9, v7.b[3]
bfi w10, w9, #3, #1
umov w9, v7.b[4]
bfi w10, w9, #4, #1
umov w9, v7.b[5]
bfi w10, w9, #5, #1
umov w9, v7.b[6]
and w9, w9, #0x1
orr w9, w10, w9, lsl #6
umov w10, v7.b[7]
and w10, w10, #0x1
umov w12, v7.b[8]
orr w9, w9, w10, lsl #7
and w10, w12, #0x1
orr w9, w9, w10, lsl #8
ldp q19, q17, [x8, #32]
cmlt v20.16b, v17.16b, #0
umov w8, v20.b[15]
umov w10, v7.b[15]
umov w12, v7.b[9]
orr w8, w10, w8
and w10, w12, #0x1
orr w9, w9, w10, lsl #9
umov w10, v7.b[10]
and w10, w10, #0x1
orr w9, w9, w10, lsl #10
umov w10, v7.b[11]
and w10, w10, #0x1
umov w12, v7.b[12]
orr w9, w9, w10, lsl #11
and w10, w12, #0x1
orr w9, w9, w10, lsl #12
umov w10, v7.b[13]
and w10, w10, #0x1
orr w9, w9, w10, lsl #13
umov w10, v7.b[14]
and w10, w10, #0x1
orr w9, w9, w10, lsl #14
orr w8, w9, w8, lsl #15
umov w9, v20.b[1]
umov w10, v20.b[0]
and w10, w10, #0x1
bfi w10, w9, #1, #1
umov w9, v20.b[2]
bfi w10, w9, #2, #1
umov w9, v20.b[3]
bfi w10, w9, #3, #1
umov w9, v20.b[4]
bfi w10, w9, #4, #1
umov w9, v20.b[5]
bfi w10, w9, #5, #1
umov w9, v20.b[6]
and w9, w9, #0x1
orr w9, w10, w9, lsl #6
umov w10, v20.b[7]
and w10, w10, #0x1
orr w9, w9, w10, lsl #7
umov w10, v20.b[8]
and w10, w10, #0x1
orr w9, w9, w10, lsl #8
umov w10, v20.b[9]
and w10, w10, #0x1
orr w9, w9, w10, lsl #9
umov w10, v20.b[10]
and w10, w10, #0x1
orr w9, w9, w10, lsl #10
umov w10, v20.b[11]
and w10, w10, #0x1
orr w9, w9, w10, lsl #11
umov w10, v20.b[12]
and w10, w10, #0x1
orr w9, w9, w10, lsl #12
umov w10, v20.b[13]
and w10, w10, #0x1
orr w9, w9, w10, lsl #13
umov w10, v20.b[14]
and w10, w10, #0x1
orr w9, w9, w10, lsl #14
cmlt v7.16b, v19.16b, #0
umov w10, v7.b[15]
cmlt v20.16b, v6.16b, #0
umov w12, v20.b[15]
orr w8, w8, w9
orr w9, w12, w10
umov w10, v20.b[1]
umov w12, v20.b[0]
and w12, w12, #0x1
bfi w12, w10, #1, #1
umov w10, v20.b[2]
umov w13, v20.b[3]
bfi w12, w10, #2, #1
bfi w12, w13, #3, #1
umov w10, v20.b[4]
umov w13, v20.b[5]
bfi w12, w10, #4, #1
bfi w12, w13, #5, #1
umov w10, v20.b[6]
and w10, w10, #0x1
orr w10, w12, w10, lsl #6
umov w12, v20.b[7]
and w12, w12, #0x1
orr w10, w10, w12, lsl #7
umov w12, v20.b[8]
and w12, w12, #0x1
orr w10, w10, w12, lsl #8
umov w12, v20.b[9]
and w12, w12, #0x1
orr w10, w10, w12, lsl #9
umov w12, v20.b[10]
and w12, w12, #0x1
orr w10, w10, w12, lsl #10
umov w12, v20.b[11]
and w12, w12, #0x1
orr w10, w10, w12, lsl #11
umov w12, v20.b[12]
and w12, w12, #0x1
orr w10, w10, w12, lsl #12
umov w12, v20.b[13]
and w12, w12, #0x1
orr w10, w10, w12, lsl #13
umov w12, v20.b[14]
and w12, w12, #0x1
orr w10, w10, w12, lsl #14
umov w12, v7.b[1]
umov w13, v7.b[0]
orr w9, w10, w9, lsl #15
and w10, w13, #0x1
bfi w10, w12, #1, #1
umov w12, v7.b[2]
bfi w10, w12, #2, #1
umov w12, v7.b[3]
bfi w10, w12, #3, #1
umov w12, v7.b[4]
bfi w10, w12, #4, #1
umov w12, v7.b[5]
bfi w10, w12, #5, #1
umov w12, v7.b[6]
and w12, w12, #0x1
orr w10, w10, w12, lsl #6
umov w12, v7.b[7]
and w12, w12, #0x1
umov w13, v7.b[8]
orr w10, w10, w12, lsl #7
and w12, w13, #0x1
orr w10, w10, w12, lsl #8
umov w12, v7.b[9]
and w12, w12, #0x1
orr w10, w10, w12, lsl #9
umov w12, v7.b[10]
and w12, w12, #0x1
umov w13, v7.b[11]
orr w10, w10, w12, lsl #10
and w12, w13, #0x1
orr w10, w10, w12, lsl #11
umov w12, v7.b[12]
and w12, w12, #0x1
orr w10, w10, w12, lsl #12
umov w12, v7.b[13]
and w12, w12, #0x1
umov w13, v7.b[14]
orr w10, w10, w12, lsl #13
and w12, w13, #0x1
orr w10, w10, w12, lsl #14
orr w9, w9, w10
orr w8, w9, w8
tst w8, #0xffff
```
However, if we explicitly ask the compiler for a similar optimization on aarch64:
```zig
fn is_ascii(input: u8x64) bool {
const bytes: [64]u8 = input;
const a: u8x16 = bytes[0..16].*;
const b: u8x16 = bytes[16..32].*;
const c: u8x16 = bytes[32..48].*;
const d: u8x16 = bytes[48..64].*;
const non_ascii_mask: u8x16 = @splat(0x80);
const mask: u16 = @bitCast((a | b | c | d) >= non_ascii_mask);
return mask == 0;
}
```
We get it: :+1:
```asm
orr v7.16b, v19.16b, v17.16b
orr v20.16b, v6.16b, v4.16b
orr v7.16b, v7.16b, v20.16b
cmlt v7.16b, v7.16b, #0
umaxv b7, v7.16b
fmov w8, s7
stp x10, x13, [sp, #160]
str x9, [sp, #152]
tbnz w8, #0, .LBB4_725
```
This looks like a lot better emit to me. I see the three bitwise OR's, and `cmlt` seems to be the equivalent of `vpmovb2m`/`vpmovmskb`. I am not really sure why we need the 4 instructions starting at `umaxv`, which finds the maximum in the vector for some reason. If anyone understands why those are there I would love to hear an explanation.
Regardless, I would like the compiler to do this optimization automatically, as it does for x86_64.
[Here is a godbolt link with the entire file and context](https://zig.godbolt.org/z/3xP1fTW6G). The `is_ascii` function begins on line 1547. (Ctrl+G is the jump-to-line command)
Thank you to all LLVM contributors!
‒ Validark
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsXF-Tm7iy_zTMS1dcIDDGD_MQZ3Z2U7V799bW3k3dfUkJkI1iQI4k_Cef_pQksAGDwRmyyTlnXCmDme5Wq7vVP6A7wkLQTU7IozVfWfOnB1zIhPHHv3BKY8y3DyGLT4_v4cCZJCATKiBiMYE142DNV4Jm8acvLLfmTxYKEil3wnLfWujZQs8bKpMinEUss9Cz5HhPhZA4ZVFioeczoyJdKh77ybKrb982_77QjbmyzoGKj1hElFoooPmukJb7Forg6HsWWkLIWArWYmWoAQA4kQXPwQbLfbLcJ7A8GwsLBYVieKd-hlS-w0JW8sByf2oQGtmaVOxSrAjtY2Ardc__3HJEa_HUUr0-n2fGQSZMEDgkDCKcW2ghgRMcw990o8Z4DweapkCOuxTTHA4JlkAlUAExo_kGdpREBMKTOZnBM-VK83dwIBBxgiUBDHsSScaBrcH3IDxJIjRFQjgBgqNEX1MyLd_WM_FtRUBnZKYuOSiwfHsGfyYkL0VLvFVOJ2As1D0AzmOtRkKiLdB1c6iNVk5NH-fAOJDPBU5BspoOekQTVTIhOew4i4uICMDge29CKiGkMsM7OFCZAAYHIsY5ETuWx8o2kmkVdylWTGa6Un8fsBKiVRmlxwdShY3kBVFzkcoR5fBUgLYXU8IPVHnzTL_GqSCzus9_UQpQAdkJsJQk20nAEnAOOoghpnjDcQYx2dFI6mkk1JgzDQmV8KkQUq8ynENg5mDM37dWyp8_IWvpWcG76sSe5uTtq8BXgfefLFthiQDso-8ANC-g9gW3deHZbl1YPrcvBK0LwU9toV7tQkOt5dTTXr0KfBX4FcvPa68Wc0-iAAp-5IujVlPgW0u3--Tr7Pwq8FXgXasJwFaPBvddcCa40BTaXC3eiJnck2hWrwJfBd4vcBlcnp0vkVqeTAR06h7NfOqLY_oLN2Yy9ZSmOvlGphljiR_VJD-oae-x6H-6aX8MF32VR15d8-O5-IWefHXpv-dJR2hMFQmvIfHffVK_q7Vt23Ec51wRsm9UbN7nQNZrEukaC13D2__5_47qCTAOOGR7UpZLNkSWlQB4vwacpqZIApgTSIkQpvrQrL6UbJIXzfrB7zn8TXKvTpJQAayQZe1L19lWzqUcUM0Di-wyafXZZ2wff8a-V_7-kmWukvslyw6Mx7CTupbHw6M1f2qx7jK2D1FW_d7aJaPbpNsyLomQn5t0W7tJ9Skn-jj7dbXyPgaO09IcWoWNanT1p5gRAeSII5meTH3sQIxhGdvSfANrxmfwXsKG41DoilCx2xEOIZWw5iyr16VyYAWvilhVWQvncb32pKs-YguHhEaJLoYRIU0NyvLtasK6bnTtN3AndBycskwHwqnlMNzhsAuL280CFlqBizpczbg6nrJsXnJWg7odQZGJbVj-JvhYEs6bhNpiFwJ1uBEPSze4sSK1UU0U5ExCgvek5UBdPRPMVOVwrLwsDwxc1CDSbv79DxMNJCUZySVItiEyIRzWlAs5gz-Vx6iKAl03BLaTNKNfsKQsh5BEuBBEr-8DgUgFIQ5Zocu1h8QIUgmDrXUYJnSTKEOEVAql1CUOD4QTEKQq5MbMFIa1RBN2LDe12bPJddZYWb6tbKvWhaINsaARTtOTHq-aMPlc0D1OzfyguZyMjHoQt7Lf33SjS7JKNSXUmD8kCd5TxiFiRRpDSACLLYl1nZIcdymNqExPimdP8e1C5f0VffWJWK4sqUvPainNV75nzZ-KQCV1KKV08uBSuos0qRExX9mzmV4LMwu97eEMuzldNJvpwW-w5iw3U_yoUklTTkdbQbeQM-uF8dK6YKEAg7V4B6EyWtnC0Bq2Q3JZttYJ7gyH45oZ_i9fMy6LHEtiXK3CA2MeJWo54mhL8hginKtVGjPAIGhGU8xBcpyLNeOZWUa4kEyd6sCdwYeE5BCxbEdTUw0nKvjfpDQnQLOdWaiaU0OyvKRqnV4zZSrOBJWnMse-cfoC8CrPRlmq89R-MXP8UIevV51ZyG2hWJGxvToelppyMQut-cq5yqdnMseu0dlXdCof1ejKgxr26DQpwzVtUC5LQud8HNbzOu_3SUV3SHVHS3XvkOqNlurdIXU-Wur8Dql-v1-XDaFXXmWc1-nqOqQiVSz-mMBavCCwLhJRTWJwJbGlal3hUtXFgApopBU6RAdNhjTe6eNnR9N-dhZa9Hx1DKpQu47180JH9nmlO4sRS13LVFx6rV-HULdPbhLWTb3sN3VQj4rg25l3OWpCL0lfwzo4feZvKnGda18a6s51qIzQ1vl23nDQKEtc591J3eGOUuI6TU-qhHdzYSzLdVFjmN9M19UaHlzCJeH3A-xSgakRuxQ7NWSXYqfG7FLs1KBdiv2OqF1q8BLYvhuOe3S4BvopdQhG6XCNgFPqMABuVU74vuhWafESeLsftfq0uE47k2oxgG-VFt8X4Cot_lmE63gadZYj7lGHbj27bn79MXLRwM1vC5IrYL6ZL1Flh1EOGLyNHgHWaOCm7AIoqOnSIbRu6nq9bi6E7lgAbmnQA-xX9O5IxG5qfB3dPRrfAuGWxj3g3qvxEGo3Nb4B2-PXY-Muvbkie6EbjYXuoWC70qKuzBB4o7HgPYUWvfCNxsL3FFr0AjgaDeBTqNGfr9BoCJ9Ej4Gn2jEgPokevTCORsP4JHr0AjkaDeST6OGNetswmHD73hUPP3W0n36vEqM7jH-t1xeD-Fef2_DT6lnu0ONqQ-7w4-pZ7iD61eUOP6-e5Q49sDbkDj-wnuUOYl9d7i3omyKCB6Bv-F3zkBLdoX7jXfN4hLxSojfWJwC-vhe3_yjujXgr-3X-6ECwr8HHaT0yhHq973L_WdAb8W72K51ynay-Bhwndkrf-9nW8-6Yl7gNGml6R0oard56vV4bos7K9C_sQPaEW6Zl60BqLQmAxVbXlE2FmXDzX6vPtelGgwfLq3r2N9iX4N4Ohmb3guO3uxccv6sFodm50OJy_L6mB8MWdbO5aDbzgn62uJvNC_r6JG70SJQyBnokWv0RF6au_gj9HenveHSvxAv6JD6Y5i_a6vlq97j1tCRUK6TzJVBVs-zm6Hq9U_YzDA1xOSuFDL2Yulk7xUeVxMJFjbJBsW5UV0ULzYU0Bd6jyTrHMk3NV2JX3TD518AnpJnUcXlFPb8GBhnmX5o5Rh1NO9oCzW94V3eIpYxtBaR0SwBDyiSERErCgWRU91xlZAbvQRCzh4ZMOCEQUqn3jvj9DwstzvtnWL6trGv5tiLPhOIODVutjYutr9q4nlvNYWpAnOkWOU50S5godDfZSSXEnJBYC_WA5kLyIlIJT4CQmOtNKLBUI2jXVf2pug9tTfPYNFVm-EizItPddcm5d1JlU8Ey3XknWG76X_MTywkUeUy4kFgJUGqYjVDweZOO93DQzWQp2xM17YRgvemF3gsl1xm50Zz2B9lgHqdEaOOduWm5U8k5wUsGMSt7L-vZvdF3pO0vgErTWaimcQz8j77XGNKar6rtPDBsWByyVEJK861pB9VeyiXlBNY0JdqhEcslOcqufXG-0M2sFDJjfKOuWOjZPf6vs_7zg_-zhZZmRxTLt8_Y4tuwLvKo7D_c0FwolNLNUc7cW8zAQsE7yVMLrX5WaiqVPhXZ7o1kpoUqYlmGc5X4mlGM8y2cWKGshdMUfv31r9-07pyGhempdOoMuh0_sK0lgvMOQfGjGy_dJX4gj46_9OY-CgL0kDzOgwhhH0VrhyydxQKjMArnMZoHkR0Fobt8oI_IRq69dBBy556HZsRBS9cjfhiEgT8nC8uzSYZpOkvTfaZs9UCFKMij7zvz5UOKQ5IKvW8RQjk5gP6jhdQyf-CPiudNWGyE5dkpFVJcpEgqU_L4GxWCxDegH2KiokmFa6tVmOaSgQeO32otrYfWQ8HTxxs7IiltysObHWefdLP5s56DsNCznuO_AgAA__8AaFBa">