<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/78888>78888</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            vector compare not equal and compare not not equal should be handled better

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          Validark

      </td>

    </tr>

</table>

<pre>

    So I wrote some code that turns 8 bits and expands each bit into 8 bits (all 1's or all 0's) like so:

```

┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐

│    0   │    0   │    1   │    0   │    0   │    1   │    1   │    0   │

├──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┤

│00000000│00000000│11111111│00000000│00000000│11111111│11111111│00000000│

└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘

```

Now, I did write a version that works via `pdep(m, 0x0101010101010101) * 255` and also a SWAR implementation. But I also wanted to try a version that works via vectors. Here is what I wrote for that:

```zig

fn foo(a: u8) u64 {

    const x = a;

    const unique_bytes: @Vector(8, u8) = [_]u8{ 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };

    const splatted = @as(@Vector(8, u8), @splat(x));

    return @as(u64, @bitCast(@select(

        u8,

        (splatted & unique_bytes) == unique_bytes,

        @as(@Vector(8, u8), @splat(255)),

        @as(@Vector(8, u8), @splat(0)),

    )));

}

```

For those who can't read Zig that well, here is a diagram depicting what I am doing. I broadcast 8 bits in a vector of 8 bytes, in this case represented as `abcdefgh`, and isolate a single bit in each byte. Then, for each byte where the isolated bit is present, I want that to turn into a byte of all one's, otherwise, all zeros. To do that, I do an equal-each-comparison to the vector I just used to isolate each bit. In Zig, that gives me a vector of booleans, so I do a `@select` which should give me all one's corresponding to a true, and all zeros for a false. That should map to what the hardware already does anyway, so should be a no-op.

```

┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐

│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│

├───&────┼───&────┼───&────┼───&────┼───&────┼───&────┼───&────┼───&────┤

│10000000│01000000│00100000│00010000│00001000│00000100│00000010│00000001│

├──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┤

│a0000000│0b000000│00c00000│000d0000│0000e000│00000f00│000000g0│0000000h│

├───≡────┼───≡────┼───≡────┼───≡────┼───≡────┼───≡────┼───≡────┼───≡────┤

│10000000│01000000│00100000│00010000│00001000│00000100│00000010│00000001│

├──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┤

│aaaaaaaa│bbbbbbbb│cccccccc│dddddddd│eeeeeeee│ffffffff│gggggggg│hhhhhhhh│

└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘

```

Currently, that maps to this assembly for x86-64 Zen 3:

```asm

.LCPI0_0:

        .byte   128

        .byte   64

        .byte   32

        .byte   16

        .byte   8

        .byte   4

        .byte   2

        .byte   1

        .zero   1

        .zero   1

        .zero   1

        .zero   1

        .zero   1

        .zero   1

        .zero   1

        .zero   1

foo:

        vmovd   xmm0, edi

        vpxor   xmm1, xmm1, xmm1

        vpbroadcastb    xmm0, xmm0

        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]

        vpcmpeqb        xmm0, xmm0, xmm1

        vpcmpeqd        xmm1, xmm1, xmm1

        vpxor   xmm0, xmm0, xmm1

        vmovq   rax, xmm0

        ret

```

You can see the compiler wanted to optimize away using that `unique_bytes` vector twice, and changed `(splatted & unique_bytes) == unique_bytes` to be `(splatted & unique_bytes) != @splat(0)`. However, there is no `vpcmpneqb`/vector-not-equal on Zen 3 hardware. (We do have it with AVX512-enabled ISA's but we avoid it because it's slower for the hardware.) To emulate that, our instructions instead do `((splatted & unique_bytes) == @splat(0))` and then xor it with all ones to flip all the bits (`vpcmpeqd+vpxor` in assembly).

This assembly does not seem optimal to me. I think we could instead read `LCPI0_0` into xmm1, and eliminate the need to read it in the `vpand` instruction, or we could give `vpcmpeqb` the `xmmword ptr [rip + .LCPI0_0]` operand too, if the hardware likes that and if `vpcmpeqb` can take operands in memory. Alternatively, we could do a `not edi` at the beginning or a `not rax` at the end. These last two ideas I tried in my Zig code. When you change `const x = a;` in the above function to `const x = ~a;`, it gives this assembly:

```asm

foo:

        not     dil

        vpxor   xmm1, xmm1, xmm1

        vmovd   xmm0, edi

        vpbroadcastb    xmm0, xmm0

        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]

        vpcmpeqb        xmm0, xmm0, xmm1

        vpcmpeqd        xmm1, xmm1, xmm1

        vpxor   xmm0, xmm0, xmm1

        vmovq   rax, xmm0

        ret

```

Obviously, this is a double-negative, although I don't know what the internal representation looks like in the LLVM passes, so maybe it's difficult for the compiler to see this. But this should give the optimal emit below.

When you make any of the following changes, (invert the bitstring at the end, invert the condition, or swap 255 with 0) you do get optimal emit:

```diff

-    return @as(u64, @bitCast(@select(

+    return ~@as(u64, @bitCast(@select(

        u8,

-       (splatted & unique_bytes) == unique_bytes,

+       (splatted & unique_bytes) != unique_bytes,

-        @as(@Vector(8, u8), @splat(255)),

+       @as(@Vector(8, u8), @splat(0)),

-       @as(@Vector(8, u8), @splat(0)),

+       @as(@Vector(8, u8), @splat(255)),

    )));

```

optimal emit:

```asm

foo:

        vmovd   xmm0, edi

        vpxor   xmm1, xmm1, xmm1

        vpbroadcastb    xmm0, xmm0

        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]

        vpcmpeqb        xmm0, xmm0, xmm1

        vmovq   rax, xmm0

        ret

```

However, this will give me the opposite of what I want. So there could be a `not` at the beginning or end. We also should be able to fold that `not` into whatever comes next. E.g., imagine a scenario like:

```diff

-export fn foo(a: u8) u64 {

+export fn foo(a: u8, b: u64) u64 {

    const x = a;

    const unique_bytes: @Vector(8, u8) = [_]u8{ 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };

    const splatted = @as(@Vector(8, u8), @splat(x));

-   return @as(u64, @bitCast(@select(

+   return b & @as(u64, @bitCast(@select(

        u8,

        (splatted & unique_bytes) == unique_bytes,

        @as(@Vector(8, u8), @splat(255)),

        @as(@Vector(8, u8), @splat(0)),

    )));

}

```

In this case, we could do an `andn` with `b` at the end instead of inverting the vector or the `rax` register or the other ideas I already mentioned. I'm not here to tell you which of those options are best for the hardware, but I have given a bunch of options to eliminate instructions.

[Godbolt link](https://zig.godbolt.org/z/6rhYfoe59)

Thank you to all LLVM contributors!

‒ Validark

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsW9tz4zav_2uYF0w8smIr8kMecmlOM9NzmbM7u1_70qEkyGJDEVqSsuJ96N_-DUlJjm9JdrMzXzu12lnJEPkjCBEgQCDcGLFUiFdsfsPmd2e8tRXpq09cioLrx7OMivXVB4IH6DRZBEM1Qk4Fgq24BdtqZSCFTFgDXBWATw1XhQHkeeWoIJSloQGLUy4lTFl8aYA0uB-R-8HiBUjx6ODZxTWL7lg0_JtE_f_h508xW8xYejs8RD_m4foEeAL89ofFzrKMAQAiAHiZMH21xetdXsbYYmtxZNppwhYXhx9envbNCfAE-O2A17MdbYn66zhh2l_v6fI6xra2zH6wkbg5AZ4Av_1hkR72gPy__0Mdi2_hAQpRQKeFReCwQm0EqeCZdaQfDawEB5ZETYENi9Pa9Ymeoun2f877YvE1xPM5SyLvxnFpCDh8-Hz9_yDqRmKNynIrSE3gprXwEFp0XFkswBJYvT7OwQpzS9pM4GfUCMJA594PLmVJ2nc45vl9FctAKRWURM6JZBfX0KaO7zaZAbu8CQ3cNpiTMhaegF3cAWcXe29aJb60-Hu2tmgcDJtFnzx7LE5TJ5-A67qz-c3vbH7XpuzyBqKnNArim_X3uL9P-3uU9vdZf48HcQO7vDvAimkkt05-frBZxA2L08P8uAc2i3wPFqdPnrTYwtTonPERp01mfadM2FtubMA2KDH3z2NHd7lBbrdJLE43DMbJtuCChBzf2-RdjG-ZlFt-YVrvgokOgAykjcTcFzmuXvd-SZJB6CqCnCsWX1rQyAv4TSz75Y1SuqGrfk1zKARfal5DgY3IrVDLYZ07Ggm1nMADZJp4kXNjh6BIKBg0BKh01F6U7o2thIGcGwSNjUaDXt-4cUrNs7zAclk53uNbr7bCkOTeFhihlhL7CKyPxtYWJ_CxQuWaO60bydD5WdgKB4gidDXQjxqsjdP3PvAjH_uF8I4HECp9REcKQ0x3C2Qr1J0w6BmUEr6iJjOBjwQFBaUPVoyAK8AvLZfnjqnznOqGa2GcLSHPVy-hB_ijdVpsgtkZJjyEmxN4UO4TOVjP51Ks0ECNWzLOiCRy5Vk01I_vRLpRkCSCrhJ5BaaiVhYex8NsJgg5aY2mIVW4b-3lYHWLw8cY5-tlzaHk0vgPwO2AWvPG9fPLxE2y4rrouHbDuMW2hoLQxdXrjq97ZvuemZuRonNqJqdw-QT49wDcC5dHE_YfJrwpXA4PcfItTuEJ6gQ1qtNu-Dvdi2Wne8HtdC_ane6Fv9O9eHi6FyBPdwnR9NtW_1_nGOEE-E8A3NMWvqct2Z625HvaUuxpC-5pS7mnLcs9bfmuvcLNdsGupz_M3pwAT4CnveXvZMZOgH9BwP29pb9GQtZfIyHvr5FQ9NdIwP4aCWV_jYRlf42Eqr-OaMtf_oj6BPhPAHwxEXHbao3KyvV44lbzxoRDO2GAG4N1Jtf-COwpTc6TGfyGCi6OHfZzUwfK5Jfb_3uIfo_GhsNh8MSfNQJMd8-whxfJ7DD9Ij4ClBymH4E_gn4MfIf8FTX9jcgl0d4HWNW0KgDgqa79wTMWYqdB80Q6NJi6Blv3nZbjcXgGzyD9faclV8Vui3DvSBfQWA1sfqNFAyy-gXH1zO92cfK6wS_Z8PsA3h6PvkfxrMersxrn_xp2TasvAKD50-Fpa7QvKN-v1ELOFRgMZ_c51Y2QqJ_l5aixohZfEXjH19Aaf1Lt1JQl0VbqJomGA3LbiXw8xM4rrpZY-MPx70gIJZFjIsM39Y-nfSJsK5WTRBP4mTpcoQ42ps-3KHKg_uso_JL5Ae7DFM4V2XOfTQBSwd6MR-sTYHH6GaEgqPgKQVjohK3g-tO_5tP4HBXPJBbw8OHan_FnrYUOga9IFK5thjlvjevmXxtJHeo-hbk5v5-4-XwkwLr16Ykh00GtBqGM1W1uBSnjfyAvHDdBQm8U8oF8V5-5tRUqcOtvmFifsfA2uZSi8QTH7FCSN0gRvxQsvvGL14EJNVpvFi-2sgwft0y7z1Eosm4Z1mHBcelGq3ECD24fUI9OhrlPXAwz9sk0lkSDovoRLY1q5YsJpaiFCgJEUBiWtO8Z8lqO7NnnqggIo2y9uPVmXJ_A2Uw180szdH-DDUkioAa1FzCRT86V2xkbKR6djJ1q-UxcuTuYU1TLH3EA8pm_GmvS6wlcS4tacStWGLbSke8hNeUE7Cyt-8whWZThUijlFNqnl_o2zpRs2qAqfMrPIEhuLNiOQBTIjfswWmDhmVj7rGZOBU7gs1s_a2dYvOY72L2UelgdDp9ntEIoW-Vl7r7OTvs_-w5eZEM6bss1eNUTOLQHuZm6qxDyu_ee17ex0-b0js3pf7OVoNYMnqEwfZKc2kziucKlX-0hMWwrapeVT8SGVPujom6TFhXKa4fcJMF9KQpIokcTCof79fjLL5_-Gxq3tIbkbs3X2WiuC1GWIm-lHS32uGVa6rdRYUKJi2f5efbXNR-sG9Z-L5DUbRnGUXlqp-hcrYGCmShJSuqcrga18tyxOBVqhdqO5thq12Sju6EKYGySkyrEM-NmOt5APJ8HQ-82Aj94QbBEu8XrMSVzAgmkc_jeGhK3pDd9_3x3Acp5T3pPAUrP01tQgtdxGGWLn_dWszzj6T31LDtCehfU93F0YGovVdocMg5vWZsvbgD_tCDkfbZ4y38WBjoh5VjSEuxaQ0aEGp6hOI8rO4EP1Hvc-absJHgaxzwR73J8xlAg-KxaJZPoXVCSxRiB9DDe8XPDOiadSXYOJT7ZCfw0WU68Eaz5Uihf2JSj4lqQN_tvMGv41JC28GrtIItvjja9hcw_Oot2KjncGKB3bBd918zb5FPN4g-oWXx4Vi24F0EoXzCoCuUL25y3wJIo2w4UxtCMyt7pCIcFY-1d7zKxJOqDDI1LYSyOb3y53xhfDDVsNSrns2AxgQcWX9beeQ_1hgQWpfROS6i28-4SmeBouQDZhVYZGrsXYnut9JXIPox3xkwBh6xVAWYAsPQsjHwee2_Xzc1v_ouKjKQFKdSjs9hxWlnbOIVl8T2L77-K5WQZ2kxILx2FxfeJrn4tCef-Q22FyFw9-olZ8hG390xzUlaLrLWkjXM8nnPwU8zSiC1iGP_2rri6KBYXC36GV9PLaJ7GF9P08qy6wnLOc4yjMomzZMEv5lGaXGBWznm5KMoiPxNXcRTPomk8jS5nSXw5yWezKV6WJb_gGU8XJZtFWHMhJ1KuajebM2FMi1eXaZqmZ5JnKI3_i8A4VtiBf8nimM3vzvSV63OetUvDZpEUxpoNihVW4lW_XEINJ_rPHQ5j_IHSM-rmzWafqLgqJLpHa1GftVpebX-HpbBVm01yqll870bub-eNpj-8nbj3_BoW3_v5_DsAAP__MJUMhQ">