<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/79690>79690</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Vector Saturating Subtractions should be flipped around when the result is AND'ed
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
Validark
</td>
</tr>
</table>
<pre>
Sorry about that title, not sure what to call this.
In simdjzon, a Zig port of simdjson, we have code like this:
```zig
fn must_be_2_3_continuation2(prev2: Chunk, prev3: Chunk) Chunk {
const is_third_byte: Chunk = @bitCast(prev2 -| @as(Chunk, @splat(0b11100000 - 1)));
const is_fourth_byte: Chunk = @bitCast(prev3 -| @as(Chunk, @splat(0b11110000 - 1)));
const i1xchunk_len = @Vector(chunk_len, i1);
const result = @as(i1xchunk_len, @bitCast((is_third_byte | is_fourth_byte) > @as(@Vector(chunk_len, u8), @splat(0))));
return @as(Chunk, @bitCast(@as(IChunk, result))) & @as(Chunk, @splat(0x80));
}
```
The x86 codegen was:
```asm
.LCPI0_0:
.zero 16,223
.LCPI0_1:
.zero 16,239
.LCPI0_2:
.zero 16,128
must_be_2_3_continuation1:
vpsubusb xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vpsubusb xmm1, xmm1, xmmword ptr [rip + .LCPI0_1]
vpor xmm0, xmm1, xmm0
vpxor xmm1, xmm1, xmm1
vpcmpeqb xmm0, xmm0, xmm1
vpandn xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
ret
```
The aarch64 codegen was:
```asm
must_be_2_3_continuation1:
movi v2.16b, #223
uqsub v0.16b, v0.16b, v2.16b
movi v2.16b, #239
uqsub v1.16b, v1.16b, v2.16b
orr v0.16b, v1.16b, v0.16b
cmeq v0.16b, v0.16b, #0
movi v1.16b, #128
bic v0.16b, v1.16b, v0.16b
ret
```
I tried cleaning it up like so:
```zig
fn must_be_2_3_continuation2(prev2: Chunk, prev3: Chunk) Chunk {
const is_third_byte = @select(u8, prev2 >= @as(Chunk, @splat(0b11100000)), @as(Chunk, @splat(0x80)), @as(Chunk, @splat(0)));
const is_fourth_byte = @select(u8, prev3 >= @as(Chunk, @splat(0b11110000)), @as(Chunk, @splat(0x80)), @as(Chunk, @splat(0)));
return (is_third_byte | is_fourth_byte);
}
```
Unfortunately, LLVM still did not see the optimization here. x86 emit:
```asm
.LCPI0_0:
.zero 16,224
.LCPI0_1:
.zero 16,240
.LCPI0_2:
.zero 16,128
must_be_2_3_continuation2:
vpmaxub xmm2, xmm0, xmmword ptr [rip + .LCPI0_0]
vpmaxub xmm3, xmm1, xmmword ptr [rip + .LCPI0_1]
vpcmpeqb xmm0, xmm0, xmm2
vpcmpeqb xmm1, xmm1, xmm3
vpor xmm0, xmm0, xmm1
vpsllw xmm0, xmm0, 7
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
ret
```
aarch64 emit:
```asm
must_be_2_3_continuation2:
movi v2.16b, #223
cmhi v0.16b, v0.16b, v2.16b
movi v2.16b, #239
cmhi v1.16b, v1.16b, v2.16b
orr v0.16b, v0.16b, v1.16b
movi v1.16b, #128
and v0.16b, v0.16b, v1.16b
ret
```
(this actually is a bit shorter than the other emit, so maybe it's better? I am not familiar with how expensive each instruction is)
Then I tried:
```zig
fn must_be_2_3_continuation3(prev2: Chunk, prev3: Chunk) Chunk {
const is_third_byte: std.meta.Int(.unsigned, chunk_len) = @bitCast(prev2 >= @as(Chunk, @splat(0b11100000)));
const is_fourth_byte: std.meta.Int(.unsigned, chunk_len) = @bitCast(prev3 >= @as(Chunk, @splat(0b11110000)));
return @select(u8, @as(@Vector(chunk_len, bool), @bitCast(is_third_byte | is_fourth_byte)), @as(Chunk, @splat(0x80)), @as(Chunk, @splat(0)));
}
```
This turned out almost identical in both emits discussed. The only difference is that the register allocator flipped the order of the vector-OR registers, so no relevant difference. I have included this just as an extra test case.
Lastly, I wrote this implementation.
```zig
export fn must_be_2_3_continuation4(prev2: Chunk, prev3: Chunk) Chunk {
const is_third_byte: Chunk = prev2 -| @as(Chunk, @splat(0b11100000 - 0x80));
const is_fourth_byte: Chunk = prev3 -| @as(Chunk, @splat(0b11110000 - 0x80));
return (is_third_byte | is_fourth_byte) & @as(Chunk, @splat(0x80));
}
```
x86 emit:
```asm
.LCPI0_0:
.zero 16,96
.LCPI0_1:
.zero 16,112
.LCPI0_2:
.zero 16,128
must_be_2_3_continuation4:
vpsubusb xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vpsubusb xmm1, xmm1, xmmword ptr [rip + .LCPI0_1]
vpor xmm0, xmm1, xmm0
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
ret
```
aarch64 emit:
```asm
must_be_2_3_continuation4:
movi v2.16b, #96
uqsub v0.16b, v0.16b, v2.16b
movi v2.16b, #112
uqsub v1.16b, v1.16b, v2.16b
orr v0.16b, v1.16b, v0.16b
movi v1.16b, #128
and v0.16b, v0.16b, v1.16b
ret
```
As you can see, with the last implementation we finally got optimal x86 codegen! You can see that the emit between the 1st implementation and the last implementation is different in the same way in x86 and aarch64, this optimization eliminates instructions on both platforms. However, implementations 2 and 3 (identical emit) have the same number of instructions on aarch64 as implementation 4, so if the instructions being switched out have an identical cost, the emit from implementations 2/3 or 4 would be acceptable on aarch64.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzcWVlv474R_zTyyyCGRMnXgx-ySYMG2B7otgu0LwEljS3uUqSWhx3vpy9IHZYP2U42-0fRIEgscfibGR7zmxlTrdlaIC6Dyadg8jii1hRSLb9SznKqvo9Sme-WX6RSO6CptAZMQQ0YZjgG5AGENKCtQtj61xIyyjmYgulxED4G4X3991mAZmX-7acUbhaF_7A1VFIZkKt6RNcjW4SCbhAymSNw9h09VhDf99GCaVj__mTr-s1KQGm1eUnxhbzEL5kUhglLDZOCBGReKdyQIL6Hh8KK706PexP33izqDxDMPtWIAACZFNoA0y-mYCp_SXcGuykQxI8QJGHKzAPVplUCd8Hswb2nOiDzTl2QhLri1ImFaRRFofuBO4gCsmh-43OKV9IqU9yiOb5Rc3ST5ug1cwAvHEWr7itmRqqAzLsRB8-i8wgKteWmnetN6oM2lu1dcOP9dQbnzNECkAUE8Z86vEGb7Nw7duj73t0jexUaq8TZheuZ14w-d8O1gx0kBGR6efFfOxta7cHs8eg89w_5PwuE1_nU34Q1CtjSwWtAdVm_GX9--Ptz-BJ2gtD8jH-ikgAQTQPyQEh8IB7txQ_l4sWBHBmA9cIRmddjQ_cwOpm9qbRNrU7b59eyDN2q9f5vpcqhMgqCySfFKgjIJ-icnDxexYsanOg6XnQGT6oju6LOviPJV6nO6YuO5bKywh-XPG5nbCoqcvHWRSEnTig0Zw-ZO16UqqyYJueP2Mnhun1jS7lhsCHjaJr6W0Di7si1IvaHtm4ZNmEr1ftUz7wG2Z7OE8ioA4ouQUql-kqjI0MOZLMSf8CAuQGJw3PGRj2B7na0IinL4AjwvAFD-9cMP4NRDHPIOFLBxBqYAVvVzKnl_w5vQksFGjlmLiC6KF3jERfW-0xxmTbbmHuF7Pbx9prk7Rx8wYn4ZieiP9CJltxuYtcbielfYiWVsYIa5DtnxufPX_8C2jDOIWd5nRKiS9wQZGVYyX76AwUFKhx7TsOSmV8ls4bKkiEqO8t8SfiBjHY6e1OV9NWmLkiTX2OyDif-NQa7yjdkYMYZMouv0uMxie2JmfPtieDsWIqK_M1pQI_xLofKluxuOnu3b7mP9c76i3SXlQUb4o9BujtBPWa8DvUNjDdoRXTBiotUVu_a7aiD20TmrtYDmhlLOd-B-wwpM6ALqQwqV3qKOqyYAlW9k-QBtISS7lIE9zjTkKIxqIL4CZ6Blj4erWjJOKMKtswUUMgt4GuFQrMNAtKsACa0UTbzcYppFwwPU3HRMu2vUGr84aWoNvm4REPHz8KxwNgKX83nDrdXEy1gqFx9F_PeVqzebNtAPfsuPh2q7Y4Y-2oNmUrJ97S7N-0WGv3NvH6tbGQanNeYg7QGKC-l254chWEZ5cAEpNIU_vZoyJnOrNaYj8FVBFLwHeRstUKFIkN3Bet-T4GgcM20u4aUc5lRIxWsOKsqzOs7qXJUIFf-YeMX9e5v_-hm6eamCgkKOW6oMD1FY3iu-z5MZNzmHpJp-Ga1AaqBCsBXoygY1AYyqvGgu_SZalOnIs-wVdLUTSNgZcWxRGH85Rtfvrb46ttRF25v8hsbSe_pHZ3pKFy6jofK3touGlD2pvzy43skH5VLNunhYnpTXySKyAdmkcn_dV_EJQe_q4XxgXnd6SaczcDaA9IKvauLcQzaHacT1PendRebGX9wWuf_3mvYSQsZFa429J1-l4k5ruDUxauDWA1bhBUTPgtcS1OXkZT3-6EBieDfe8Q9Tblz4FLALWKdLEan8M63IdVMd7xkHFk6OU1LhC3duWdng5vfnDvnimebg1IXOSuZq5B1P7PUIBvydQFvJVWpx_BnucUNKt9MP7BEA_GKYh9eO_6us95FTZedccKWaU2_x_ra-0GPCRGShpJZzdkHE1NkYg16y0xWNJmEV0hFL5XIpDa1_82yr5QsT70IyFMMUkECW2l5DikCzTKsDE059kwcj_JlnC_iBR3hMpqF03gymZD5qFhiPMvIZJKkqwVO89UknsTTWRLPVnS-ClMyH7ElCUkSRmQWRfE8mY8xSUgSRxiFWZJlYRIkIZaU8THnm3Is1XrEtLa4nC2mi3DEaYpc-6_ACBG4BT8YEBeLRmrp5tyldq2DJORMG71H8V-ELescEr5QYxU1buW-2NQo2iymLlq_24SJKmlFDtuiOaPN1yVMw_1fHwMyw3xkFV8WxlS-JUueAvK0Zqaw6TiTZUCenAXNv7tKyW8-u33yduuAPHm__hsAAP__du_ndA">