<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/72088>72088</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [x86_64] Isolating bits can be optimized to a pext instruction
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          Validark
      </td>
    </tr>
</table>

<pre>
    Sometimes, we have a bitstring where we want to extract certain bits. E.g. to extract the `X` bits from `XXXX....XXXX`, we can do:

`(i & 0xF) | ((i >> 4) & 0xF0)`

This has 4 operations and will take ~3 cycles, assuming the `(i & 0xF)` can run in parallel to `i >> 4` or the `& 0xF0` (and not counting the `mov`).

```asm
 mov     eax, edi
        shr     edi, 4
        and     eax, 15
 and     edi, 240
        or      eax, edi
```

Alternatively, we can condense this into a single pext instruction on x86_64 machines with bmi2.

```asm
        mov     eax, 3855
        pext    eax, edi, eax
```

Even fancier would be if we could detect more advanced bit-concentration patterns. Here is some real Zig code which concentrates the highest bit of each byte:

```zig
fn swarMovMask(v: anytype) @TypeOf(v) {
    comptime assert(@divExact(@bitSizeOf(@TypeOf(v)), 8) <= 8);
    const ones: @TypeOf(v) = @bitCast(@as(@Vector(@sizeOf(@TypeOf(v)), u8), @splat(1)));
    const msb_mask = 0x80 * ones;

    // We are exploiting a multiplication as a shifter and adder, and the derivation of this number is
    // shown here as a comptime loop.
    // This trick is often generalizable to other problems too: https://stackoverflow.com/a/14547307
    comptime var mult: @TypeOf(v) = 0;
    comptime for (0..@sizeOf(@TypeOf(v))) |i| {
        mult |= @as(@TypeOf(v), 1) << (7 * i);
    };

    // Example with 32 bit integers:
    // We want to concentrate the upper bits of each byte into a single nibble.
    // Doing the gradeschool multiplication algorithm, we can see that each 1 bit
    // in the bottom multiplicand shifts the upper multiplicand, and then we add all these
    // shifted bitstrings together. (Note `.` represents a 0)
    // a.......b.......c.......d.......
    // * ..........1......1......1......1
    // -------------------------------------------------------------------------
 //   a.......b.......c.......d.......
    // .b.......c.......d..............
    // ..c.......d.....................
    // + ...d............................
    // -------------------------------------------------------------------------
 //   abcd....bcd.....cd......d.......

    // Then we simply shift to the right by `32 - 4` (bitstring size minus the number of relevant bits) to isolate the desired `abcd` bits in the least significant byte!

 // Inspired by: http://0x80.pl/articles/scalar-sse-movmask.html
    return (mult *% (v & msb_mask)) >> (@bitSizeOf(@TypeOf(v)) - @sizeOf(@TypeOf(v)));
}
```

For 32 bit integers, this becomes:

`(0x204081 *% (v & 0x80808080)) >> 28`

In assembly:

```asm
        and     edi, 0x80808080
        imul eax, edi, 0x204081
        shr     eax, 28
```

Which we can reduce to:

```asm
        mov     eax, 0x80808080
        pext eax, edi, eax
```

Note: for AMD CPUs, this optimization should only be applied to Zen 3 or newer. Before that, the pext instruction was implemented via microcode and was very slow.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy0WM1u4zgSfhr5UoggS7ItH3xI4gTbh95dYGZnGnMZUFTJ4oYiBZKS7T7ssy-KlGPZjjvTwK7aaNoiWVWsn68-hlkrdgpxEy2eosV2xnrXaLP5jUlRMfM2K3V13PyiW3SiRRulz7BHaNiAwKAUzjoj1A72DRqkmT1TDpwGPDjDuAOOxjGh_NIYXuJdPJ11DUK0TL5Fy8SvgNro1r_59u1bHMcxjdEyGdVypqDSUfYYJdsoOf1P04WAKF1CcniN0jVEq2eI0iK8zl6i7AVy_z4sSaJ0TbsmQn5thIWGWchBd2iYE1pZYKqCvZASHHtD-E8G_MhlcAKztm_p5OMRrkygA5G1plcgFHTMMClR0tmjZTKxapmANmcho33LhOwn9Uo74LpXbqKr1YNfvI6v_eA_zLbhDbR6AHqQHchmrMQ4MT62MWG-EjSfX86S-snu-WKcfn8fdqV5crlPB6E3Wt8NnFr9KB0axZwYUB4nYeZaVagsgqPICOU0MLBC7SRChwcHQllnek6BAq3gUCz_XObQMt4IhRb2wjVQtiL9zEfjc-WqrFgsLhd4pZenooEdfnC4lwEV1ExxgQb2upcVlAii9of0Pyt0yB202iCwamCKY0WV8MC14qhcyETomCM32Rj-RnUmLFjdIhhkEv4QO-C6Qtg3gjdw3ojWJ0wjdg1aR1JB14CMN1AeHX5QRv7zXezCm1qB3TPzVQ9fmX2L0mKIMor-0R079OWUJ78eO_xH7ed83T2dncZ12xFmUKmgcVSPeVKJ4eXA-PirFO4X8T0IuBbmP89QeLnZc5Rt_fcou1ChrAOt0JJlt-ZkWwhanpkdVTIbxt-QO23Cd_uJDX0xfqG1nWQkan6a_sii1pZ_tsy-eQuSQ0HV_Dja-TR1Om2J0tcofYXfEZhBwEMntfDVzqDtpROdFDxkAbNUA42oHRpfhayq0Hg0UpWPdYVGDGGxrkPpqL4t0YCwNxpto_cKPHB7ye8Bk1p38c1yD5HOCP5G-adrhwp2qNAwKb6zUiKBm3YNGuiMLiW2FpwmtIbGuY5CFCRZx_ibHtDUUu9jrtsofWVR-jrPF_kqS1YfpNDAjHfG3TAnV1EY99XaEJAmcfxpmH3XEL5zTLPYY0MvHc2O-XRKoSsZzzB_z1XfflY-6uI6RaLV9n4WvBxY20kM8JWlvmiFcrhDY98L9jJrTh13Uvg-F_quQxO66rTsr7BUibKUeBvtrT41nJ1hFVreaC1vElLutBGuaSfAbZHUMxc0zsmAG-FCecmldk63E6GqCultJweYzk5SXZE-VlXAqD03aPGD_KZKqc4shfJxh5ShMcXn79r5dhpTuzXYGbSoHFWCJwjX4lgcnnIc-ThW43izgaIfvz_zD4ebTQ__q2eUPIqFnzf_7sq7G-6tvO-gJ7i7-t6m_5-DSu4tGYd4HK4c9AEshly0ou3kMSQdlSNlsBG7xkF5pCzLUngIfC9KizNxJlCCVqg-5PyI1roGgxIHKm1aS8DiNAir5am8K7TCYEWiyfJ3Bj3WlkRmHRC1FzXVjgs9P51fHGM8wxdlOy-sPJ7g-h2tqX3FnSSMNk4E-vtqOZPMPFiLD60eqNXFjWvl2TkGXW8UHTWAZ_oYpQv6OXiSfGqQJ-QNbPgvkQJ4gM-x_Aywq-0P-NmrNjcgmz6Hvlki1y3aDy8bySFN8qSY35yLnBX-XZ4sLa40f1GeFrWlPN7jYTcU9Yp3T3RdLBNtL69Y6sneO-w_rE2LH3jqd88tR4g3WPWc2v1fNv2KXd8z3XPsnyHYBOGUsdTnH79u4fmf_zpHUBMHEN9Dq7KN59taySNxcNZ1UmBFRfUHKsjo0qJwT53hCWti49TDgqgPLhx7ZoHqHVtU1GIGwaAV3GhPxP21kVkY0BzBEsmZVZusWmdrNsPNfLleF8tkvspnzSbHJE1YlWVrVhVFzuerFSvzxbwu6zVfFNVMbNIkzebzeZas0nSRx3WdVyWvF8V8xZMc0yhPsGVCxlIObazNbias7XGzSpOimElWorT-bp-mCvfgJ6M0pau-2dCeh7Lf2ShPpLDOnqU44aT_o0C4WUWLLXzx8EOo5ZGGUqHEk5uDM9mNr2a9kZtLArgTrunLkfmRwnF46Iz-N9Ll4NWbSUjjj_HfAAAA___JpKvD">