<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/79690>79690</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            Vector Saturating Subtractions should be flipped around when the result is AND'ed

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          Validark

      </td>

    </tr>

</table>

<pre>

    Sorry about that title, not sure what to call this.

In simdjzon, a Zig port of simdjson, we have code like this:

```zig

fn must_be_2_3_continuation2(prev2: Chunk, prev3: Chunk) Chunk {

    const is_third_byte: Chunk = @bitCast(prev2 -| @as(Chunk, @splat(0b11100000 - 1)));

    const is_fourth_byte: Chunk = @bitCast(prev3 -| @as(Chunk, @splat(0b11110000 - 1)));

    const i1xchunk_len = @Vector(chunk_len, i1);

    const result = @as(i1xchunk_len, @bitCast((is_third_byte | is_fourth_byte) > @as(@Vector(chunk_len, u8), @splat(0))));

    return @as(Chunk, @bitCast(@as(IChunk, result))) & @as(Chunk, @splat(0x80));

}

```

The x86 codegen was:

```asm

.LCPI0_0:

        .zero   16,223

.LCPI0_1:

 .zero   16,239

.LCPI0_2:

        .zero 16,128

must_be_2_3_continuation1:

        vpsubusb        xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]

        vpsubusb        xmm1, xmm1, xmmword ptr [rip + .LCPI0_1]

        vpor    xmm0, xmm1, xmm0

        vpxor xmm1, xmm1, xmm1

        vpcmpeqb        xmm0, xmm0, xmm1

 vpandn  xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]

        ret

```

The aarch64 codegen was:

```asm

must_be_2_3_continuation1:

        movi v2.16b, #223

        uqsub   v0.16b, v0.16b, v2.16b

        movi v2.16b, #239

        uqsub   v1.16b, v1.16b, v2.16b

        orr v0.16b, v1.16b, v0.16b

        cmeq    v0.16b, v0.16b, #0

        movi v1.16b, #128

        bic     v0.16b, v1.16b, v0.16b

 ret

```

 I tried cleaning it up like so:

```zig

fn must_be_2_3_continuation2(prev2: Chunk, prev3: Chunk) Chunk {

    const is_third_byte  = @select(u8, prev2 >= @as(Chunk, @splat(0b11100000)), @as(Chunk, @splat(0x80)), @as(Chunk, @splat(0)));

    const is_fourth_byte = @select(u8, prev3 >= @as(Chunk, @splat(0b11110000)), @as(Chunk, @splat(0x80)), @as(Chunk, @splat(0)));

    return (is_third_byte | is_fourth_byte);

}

```

Unfortunately, LLVM still did not see the optimization here. x86 emit:

```asm

.LCPI0_0:

        .zero 16,224

.LCPI0_1:

        .zero   16,240

.LCPI0_2:

        .zero 16,128

must_be_2_3_continuation2:

        vpmaxub xmm2, xmm0, xmmword ptr [rip + .LCPI0_0]

        vpmaxub xmm3, xmm1, xmmword ptr [rip + .LCPI0_1]

        vpcmpeqb        xmm0, xmm0, xmm2

        vpcmpeqb xmm1, xmm1, xmm3

        vpor    xmm0, xmm0, xmm1

        vpsllw xmm0, xmm0, 7

        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]

 ret

```

aarch64 emit:

```asm

must_be_2_3_continuation2:

        movi    v2.16b, #223

        cmhi    v0.16b, v0.16b, v2.16b

        movi    v2.16b, #239

        cmhi    v1.16b, v1.16b, v2.16b

        orr     v0.16b, v0.16b, v1.16b

        movi    v1.16b, #128

        and     v0.16b, v0.16b, v1.16b

        ret

```

(this actually is a bit shorter than the other emit, so maybe it's better? I am not familiar with how expensive each instruction is)

Then I tried:

```zig

fn must_be_2_3_continuation3(prev2: Chunk, prev3: Chunk) Chunk {

    const is_third_byte: std.meta.Int(.unsigned, chunk_len)  = @bitCast(prev2 >= @as(Chunk, @splat(0b11100000)));

    const is_fourth_byte: std.meta.Int(.unsigned, chunk_len) = @bitCast(prev3 >= @as(Chunk, @splat(0b11110000)));

    return @select(u8, @as(@Vector(chunk_len, bool), @bitCast(is_third_byte | is_fourth_byte)), @as(Chunk, @splat(0x80)), @as(Chunk, @splat(0)));

}

```

This turned out almost identical in both emits discussed. The only difference is that the register allocator flipped the order of the vector-OR registers, so no relevant difference. I have included this just as an extra test case.

Lastly, I wrote this implementation.

```zig

export fn must_be_2_3_continuation4(prev2: Chunk, prev3: Chunk) Chunk {

    const is_third_byte: Chunk = prev2 -| @as(Chunk, @splat(0b11100000 - 0x80));

    const is_fourth_byte: Chunk = prev3 -| @as(Chunk, @splat(0b11110000 - 0x80));

    return (is_third_byte | is_fourth_byte) & @as(Chunk, @splat(0x80));

}

```

x86 emit:

```asm

.LCPI0_0:

        .zero   16,96

.LCPI0_1:

 .zero   16,112

.LCPI0_2:

        .zero 16,128

must_be_2_3_continuation4:

        vpsubusb        xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]

        vpsubusb        xmm1, xmm1, xmmword ptr [rip + .LCPI0_1]

        vpor    xmm0, xmm1, xmm0

        vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]

        ret

```

aarch64 emit:

```asm

must_be_2_3_continuation4:

        movi    v2.16b, #96

        uqsub   v0.16b, v0.16b, v2.16b

        movi    v2.16b, #112

        uqsub   v1.16b, v1.16b, v2.16b

        orr     v0.16b, v1.16b, v0.16b

        movi    v1.16b, #128

        and     v0.16b, v0.16b, v1.16b

        ret

```

As you can see, with the last implementation we finally got optimal x86 codegen! You can see that the emit between the 1st implementation and the last implementation is different in the same way in x86 and aarch64, this optimization eliminates instructions on both platforms. However, implementations 2 and 3 (identical emit) have the same number of instructions on aarch64 as implementation 4, so if the instructions being switched out have an identical cost, the emit from implementations 2/3 or 4 would be acceptable on aarch64.

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzcWVlv474R_zTyyyCGRMnXgx-ySYMG2B7otgu0LwEljS3uUqSWhx3vpy9IHZYP2U42-0fRIEgscfibGR7zmxlTrdlaIC6Dyadg8jii1hRSLb9SznKqvo9Sme-WX6RSO6CptAZMQQ0YZjgG5AGENKCtQtj61xIyyjmYgulxED4G4X3991mAZmX-7acUbhaF_7A1VFIZkKt6RNcjW4SCbhAymSNw9h09VhDf99GCaVj__mTr-s1KQGm1eUnxhbzEL5kUhglLDZOCBGReKdyQIL6Hh8KK706PexP33izqDxDMPtWIAACZFNoA0y-mYCp_SXcGuykQxI8QJGHKzAPVplUCd8Hswb2nOiDzTl2QhLri1ImFaRRFofuBO4gCsmh-43OKV9IqU9yiOb5Rc3ST5ug1cwAvHEWr7itmRqqAzLsRB8-i8wgKteWmnetN6oM2lu1dcOP9dQbnzNECkAUE8Z86vEGb7Nw7duj73t0jexUaq8TZheuZ14w-d8O1gx0kBGR6efFfOxta7cHs8eg89w_5PwuE1_nU34Q1CtjSwWtAdVm_GX9--Ptz-BJ2gtD8jH-ikgAQTQPyQEh8IB7txQ_l4sWBHBmA9cIRmddjQ_cwOpm9qbRNrU7b59eyDN2q9f5vpcqhMgqCySfFKgjIJ-icnDxexYsanOg6XnQGT6oju6LOviPJV6nO6YuO5bKywh-XPG5nbCoqcvHWRSEnTig0Zw-ZO16UqqyYJueP2Mnhun1jS7lhsCHjaJr6W0Di7si1IvaHtm4ZNmEr1ftUz7wG2Z7OE8ioA4ouQUql-kqjI0MOZLMSf8CAuQGJw3PGRj2B7na0IinL4AjwvAFD-9cMP4NRDHPIOFLBxBqYAVvVzKnl_w5vQksFGjlmLiC6KF3jERfW-0xxmTbbmHuF7Pbx9prk7Rx8wYn4ZieiP9CJltxuYtcbielfYiWVsYIa5DtnxufPX_8C2jDOIWd5nRKiS9wQZGVYyX76AwUFKhx7TsOSmV8ls4bKkiEqO8t8SfiBjHY6e1OV9NWmLkiTX2OyDif-NQa7yjdkYMYZMouv0uMxie2JmfPtieDsWIqK_M1pQI_xLofKluxuOnu3b7mP9c76i3SXlQUb4o9BujtBPWa8DvUNjDdoRXTBiotUVu_a7aiD20TmrtYDmhlLOd-B-wwpM6ALqQwqV3qKOqyYAlW9k-QBtISS7lIE9zjTkKIxqIL4CZ6Blj4erWjJOKMKtswUUMgt4GuFQrMNAtKsACa0UTbzcYppFwwPU3HRMu2vUGr84aWoNvm4REPHz8KxwNgKX83nDrdXEy1gqFx9F_PeVqzebNtAPfsuPh2q7Y4Y-2oNmUrJ97S7N-0WGv3NvH6tbGQanNeYg7QGKC-l254chWEZ5cAEpNIU_vZoyJnOrNaYj8FVBFLwHeRstUKFIkN3Bet-T4GgcM20u4aUc5lRIxWsOKsqzOs7qXJUIFf-YeMX9e5v_-hm6eamCgkKOW6oMD1FY3iu-z5MZNzmHpJp-Ga1AaqBCsBXoygY1AYyqvGgu_SZalOnIs-wVdLUTSNgZcWxRGH85Rtfvrb46ttRF25v8hsbSe_pHZ3pKFy6jofK3touGlD2pvzy43skH5VLNunhYnpTXySKyAdmkcn_dV_EJQe_q4XxgXnd6SaczcDaA9IKvauLcQzaHacT1PendRebGX9wWuf_3mvYSQsZFa429J1-l4k5ruDUxauDWA1bhBUTPgtcS1OXkZT3-6EBieDfe8Q9Tblz4FLALWKdLEan8M63IdVMd7xkHFk6OU1LhC3duWdng5vfnDvnimebg1IXOSuZq5B1P7PUIBvydQFvJVWpx_BnucUNKt9MP7BEA_GKYh9eO_6us95FTZedccKWaU2_x_ra-0GPCRGShpJZzdkHE1NkYg16y0xWNJmEV0hFL5XIpDa1_82yr5QsT70IyFMMUkECW2l5DikCzTKsDE059kwcj_JlnC_iBR3hMpqF03gymZD5qFhiPMvIZJKkqwVO89UknsTTWRLPVnS-ClMyH7ElCUkSRmQWRfE8mY8xSUgSRxiFWZJlYRIkIZaU8THnm3Is1XrEtLa4nC2mi3DEaYpc-6_ACBG4BT8YEBeLRmrp5tyldq2DJORMG71H8V-ELescEr5QYxU1buW-2NQo2iymLlq_24SJKmlFDtuiOaPN1yVMw_1fHwMyw3xkFV8WxlS-JUueAvK0Zqaw6TiTZUCenAXNv7tKyW8-u33yduuAPHm__hsAAP__du_ndA">