<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/79690>79690</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            Vector Saturating Subtractions should be flipped around when the result is AND'ed
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          Validark
      </td>
    </tr>
</table>

<pre>
    Sorry about that title, not sure what to call this.

In simdjzon, a Zig port of simdjson, we have code like this:

```zig
fn must_be_2_3_continuation2(prev2: Chunk, prev3: Chunk) Chunk {
    const is_third_byte: Chunk = @bitCast(prev2 -| @as(Chunk, @splat(0b11100000 - 1)));
    const is_fourth_byte: Chunk = @bitCast(prev3 -| @as(Chunk, @splat(0b11110000 - 1)));
    const i1xchunk_len = @Vector(chunk_len, i1);
    const result = @as(i1xchunk_len, @bitCast((is_third_byte | is_fourth_byte) > @as(@Vector(chunk_len, u8), @splat(0))));
    return @as(Chunk, @bitCast(@as(IChunk, result))) & @as(Chunk, @splat(0x80));
}
```

The x86 codegen was:

```asm
.LCPI0_0:
        .zero   16,223
.LCPI0_1:
 .zero   16,239
.LCPI0_2:
        .zero 16,128
must_be_2_3_continuation1:
        vpsubusb        xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpsubusb        xmm1, xmm1, xmmword ptr [rip + .LCPI0_1]
        vpor    xmm0, xmm1, xmm0
        vpxor xmm1, xmm1, xmm1
        vpcmpeqb        xmm0, xmm0, xmm1
 vpandn  xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
        ret
```
The aarch64 codegen was:
```asm
must_be_2_3_continuation1:
        movi v2.16b, #223
        uqsub   v0.16b, v0.16b, v2.16b
        movi v2.16b, #239
        uqsub   v1.16b, v1.16b, v2.16b
        orr v0.16b, v1.16b, v0.16b
        cmeq    v0.16b, v0.16b, #0
        movi v1.16b, #128
        bic     v0.16b, v1.16b, v0.16b
 ret
```

 I tried cleaning it up like so:

```zig
fn must_be_2_3_continuation2(prev2: Chunk, prev3: Chunk) Chunk {
    const is_third_byte  = @select(u8, prev2 >= @as(Chunk, @splat(0b11100000)), @as(Chunk, @splat(0x80)), @as(Chunk, @splat(0)));
    const is_fourth_byte = @select(u8, prev3 >= @as(Chunk, @splat(0b11110000)), @as(Chunk, @splat(0x80)), @as(Chunk, @splat(0)));
    return (is_third_byte | is_fourth_byte);
}
```

Unfortunately, LLVM still did not see the optimization here. x86 emit:

```asm
.LCPI0_0:
        .zero 16,224
.LCPI0_1:
        .zero   16,240
.LCPI0_2:
        .zero 16,128
must_be_2_3_continuation2:
        vpmaxub xmm2, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpmaxub xmm3, xmm1, xmmword ptr [rip + .LCPI0_1]
        vpcmpeqb        xmm0, xmm0, xmm2
        vpcmpeqb xmm1, xmm1, xmm3
        vpor    xmm0, xmm0, xmm1
        vpsllw xmm0, xmm0, 7
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
 ret
```

aarch64 emit:

```asm
must_be_2_3_continuation2:
        movi    v2.16b, #223
        cmhi    v0.16b, v0.16b, v2.16b
        movi    v2.16b, #239
        cmhi    v1.16b, v1.16b, v2.16b
        orr     v0.16b, v0.16b, v1.16b
        movi    v1.16b, #128
        and     v0.16b, v0.16b, v1.16b
        ret
```
(this actually is a bit shorter than the other emit, so maybe it's better? I am not familiar with how expensive each instruction is)

Then I tried:

```zig
fn must_be_2_3_continuation3(prev2: Chunk, prev3: Chunk) Chunk {
    const is_third_byte: std.meta.Int(.unsigned, chunk_len)  = @bitCast(prev2 >= @as(Chunk, @splat(0b11100000)));
    const is_fourth_byte: std.meta.Int(.unsigned, chunk_len) = @bitCast(prev3 >= @as(Chunk, @splat(0b11110000)));
    return @select(u8, @as(@Vector(chunk_len, bool), @bitCast(is_third_byte | is_fourth_byte)), @as(Chunk, @splat(0x80)), @as(Chunk, @splat(0)));
}
```

This turned out almost identical in both emits discussed. The only difference is that the register allocator flipped the order of the vector-OR registers, so no relevant difference. I have included this just as an extra test case.

Lastly, I wrote this implementation.

```zig
export fn must_be_2_3_continuation4(prev2: Chunk, prev3: Chunk) Chunk {
    const is_third_byte: Chunk = prev2 -| @as(Chunk, @splat(0b11100000 - 0x80));
    const is_fourth_byte: Chunk = prev3 -| @as(Chunk, @splat(0b11110000 - 0x80));
    return (is_third_byte | is_fourth_byte) & @as(Chunk, @splat(0x80));
}
```

x86 emit:

```asm
.LCPI0_0:
        .zero   16,96
.LCPI0_1:
 .zero   16,112
.LCPI0_2:
        .zero 16,128
must_be_2_3_continuation4:
        vpsubusb        xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpsubusb        xmm1, xmm1, xmmword ptr [rip + .LCPI0_1]
        vpor    xmm0, xmm1, xmm0
        vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
        ret
```

aarch64 emit:

```asm
must_be_2_3_continuation4:
        movi    v2.16b, #96
        uqsub   v0.16b, v0.16b, v2.16b
        movi    v2.16b, #112
        uqsub   v1.16b, v1.16b, v2.16b
        orr     v0.16b, v1.16b, v0.16b
        movi    v1.16b, #128
        and     v0.16b, v0.16b, v1.16b
        ret
```

As you can see, with the last implementation we finally got optimal x86 codegen! You can see that the emit between the 1st implementation and the last implementation is different in the same way in x86 and aarch64, this optimization eliminates instructions on both platforms. However, implementations 2 and 3 (identical emit) have the same number of instructions on aarch64 as implementation 4, so if the instructions being switched out have an identical cost, the emit from implementations 2/3 or 4 would be acceptable on aarch64.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzcWVlv474R_zTyyyCGRMnXgx-ySYMG2B7otgu0LwEljS3uUqSWhx3vpy9IHZYP2U42-0fRIEgscfibGR7zmxlTrdlaIC6Dyadg8jii1hRSLb9SznKqvo9Sme-WX6RSO6CptAZMQQ0YZjgG5AGENKCtQtj61xIyyjmYgulxED4G4X3991mAZmX-7acUbhaF_7A1VFIZkKt6RNcjW4SCbhAymSNw9h09VhDf99GCaVj__mTr-s1KQGm1eUnxhbzEL5kUhglLDZOCBGReKdyQIL6Hh8KK706PexP33izqDxDMPtWIAACZFNoA0y-mYCp_SXcGuykQxI8QJGHKzAPVplUCd8Hswb2nOiDzTl2QhLri1ImFaRRFofuBO4gCsmh-43OKV9IqU9yiOb5Rc3ST5ug1cwAvHEWr7itmRqqAzLsRB8-i8wgKteWmnetN6oM2lu1dcOP9dQbnzNECkAUE8Z86vEGb7Nw7duj73t0jexUaq8TZheuZ14w-d8O1gx0kBGR6efFfOxta7cHs8eg89w_5PwuE1_nU34Q1CtjSwWtAdVm_GX9--Ptz-BJ2gtD8jH-ikgAQTQPyQEh8IB7txQ_l4sWBHBmA9cIRmddjQ_cwOpm9qbRNrU7b59eyDN2q9f5vpcqhMgqCySfFKgjIJ-icnDxexYsanOg6XnQGT6oju6LOviPJV6nO6YuO5bKywh-XPG5nbCoqcvHWRSEnTig0Zw-ZO16UqqyYJueP2Mnhun1jS7lhsCHjaJr6W0Di7si1IvaHtm4ZNmEr1ftUz7wG2Z7OE8ioA4ouQUql-kqjI0MOZLMSf8CAuQGJw3PGRj2B7na0IinL4AjwvAFD-9cMP4NRDHPIOFLBxBqYAVvVzKnl_w5vQksFGjlmLiC6KF3jERfW-0xxmTbbmHuF7Pbx9prk7Rx8wYn4ZieiP9CJltxuYtcbielfYiWVsYIa5DtnxufPX_8C2jDOIWd5nRKiS9wQZGVYyX76AwUFKhx7TsOSmV8ls4bKkiEqO8t8SfiBjHY6e1OV9NWmLkiTX2OyDif-NQa7yjdkYMYZMouv0uMxie2JmfPtieDsWIqK_M1pQI_xLofKluxuOnu3b7mP9c76i3SXlQUb4o9BujtBPWa8DvUNjDdoRXTBiotUVu_a7aiD20TmrtYDmhlLOd-B-wwpM6ALqQwqV3qKOqyYAlW9k-QBtISS7lIE9zjTkKIxqIL4CZ6Blj4erWjJOKMKtswUUMgt4GuFQrMNAtKsACa0UTbzcYppFwwPU3HRMu2vUGr84aWoNvm4REPHz8KxwNgKX83nDrdXEy1gqFx9F_PeVqzebNtAPfsuPh2q7Y4Y-2oNmUrJ97S7N-0WGv3NvH6tbGQanNeYg7QGKC-l254chWEZ5cAEpNIU_vZoyJnOrNaYj8FVBFLwHeRstUKFIkN3Bet-T4GgcM20u4aUc5lRIxWsOKsqzOs7qXJUIFf-YeMX9e5v_-hm6eamCgkKOW6oMD1FY3iu-z5MZNzmHpJp-Ga1AaqBCsBXoygY1AYyqvGgu_SZalOnIs-wVdLUTSNgZcWxRGH85Rtfvrb46ttRF25v8hsbSe_pHZ3pKFy6jofK3touGlD2pvzy43skH5VLNunhYnpTXySKyAdmkcn_dV_EJQe_q4XxgXnd6SaczcDaA9IKvauLcQzaHacT1PendRebGX9wWuf_3mvYSQsZFa429J1-l4k5ruDUxauDWA1bhBUTPgtcS1OXkZT3-6EBieDfe8Q9Tblz4FLALWKdLEan8M63IdVMd7xkHFk6OU1LhC3duWdng5vfnDvnimebg1IXOSuZq5B1P7PUIBvydQFvJVWpx_BnucUNKt9MP7BEA_GKYh9eO_6us95FTZedccKWaU2_x_ra-0GPCRGShpJZzdkHE1NkYg16y0xWNJmEV0hFL5XIpDa1_82yr5QsT70IyFMMUkECW2l5DikCzTKsDE059kwcj_JlnC_iBR3hMpqF03gymZD5qFhiPMvIZJKkqwVO89UknsTTWRLPVnS-ClMyH7ElCUkSRmQWRfE8mY8xSUgSRxiFWZJlYRIkIZaU8THnm3Is1XrEtLa4nC2mi3DEaYpc-6_ACBG4BT8YEBeLRmrp5tyldq2DJORMG71H8V-ELescEr5QYxU1buW-2NQo2iymLlq_24SJKmlFDtuiOaPN1yVMw_1fHwMyw3xkFV8WxlS-JUueAvK0Zqaw6TiTZUCenAXNv7tKyW8-u33yduuAPHm__hsAAP__du_ndA">