<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/72088>72088</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[x86_64] Isolating bits can be optimized to a pext instruction
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
Validark
</td>
</tr>
</table>
<pre>
Sometimes, we have a bitstring where we want to extract certain bits. E.g. to extract the `X` bits from `XXXX....XXXX`, we can do:
`(i & 0xF) | ((i >> 4) & 0xF0)`
This has 4 operations and will take ~3 cycles, assuming the `(i & 0xF)` can run in parallel to `i >> 4` or the `& 0xF0` (and not counting the `mov`).
```asm
mov eax, edi
shr edi, 4
and eax, 15
and edi, 240
or eax, edi
```
Alternatively, we can condense this into a single pext instruction on x86_64 machines with bmi2.
```asm
mov eax, 3855
pext eax, edi, eax
```
Even fancier would be if we could detect more advanced bit-concentration patterns. Here is some real Zig code which concentrates the highest bit of each byte:
```zig
fn swarMovMask(v: anytype) @TypeOf(v) {
comptime assert(@divExact(@bitSizeOf(@TypeOf(v)), 8) <= 8);
const ones: @TypeOf(v) = @bitCast(@as(@Vector(@sizeOf(@TypeOf(v)), u8), @splat(1)));
const msb_mask = 0x80 * ones;
// We are exploiting a multiplication as a shifter and adder, and the derivation of this number is
// shown here as a comptime loop.
// This trick is often generalizable to other problems too: https://stackoverflow.com/a/14547307
comptime var mult: @TypeOf(v) = 0;
comptime for (0..@sizeOf(@TypeOf(v))) |i| {
mult |= @as(@TypeOf(v), 1) << (7 * i);
};
// Example with 32 bit integers:
// We want to concentrate the upper bits of each byte into a single nibble.
// Doing the gradeschool multiplication algorithm, we can see that each 1 bit
// in the bottom multiplicand shifts the upper multiplicand, and then we add all these
// shifted bitstrings together. (Note `.` represents a 0)
// a.......b.......c.......d.......
// * ..........1......1......1......1
// -------------------------------------------------------------------------
// a.......b.......c.......d.......
// .b.......c.......d..............
// ..c.......d.....................
// + ...d............................
// -------------------------------------------------------------------------
// abcd....bcd.....cd......d.......
// Then we simply shift to the right by `32 - 4` (bitstring size minus the number of relevant bits) to isolate the desired `abcd` bits in the least significant byte!
// Inspired by: http://0x80.pl/articles/scalar-sse-movmask.html
return (mult *% (v & msb_mask)) >> (@bitSizeOf(@TypeOf(v)) - @sizeOf(@TypeOf(v)));
}
```
For 32 bit integers, this becomes:
`(0x204081 *% (v & 0x80808080)) >> 28`
In assembly:
```asm
and edi, 0x80808080
imul eax, edi, 0x204081
shr eax, 28
```
Which we can reduce to:
```asm
mov eax, 0x80808080
pext eax, edi, eax
```
Note: for AMD CPUs, this optimization should only be applied to Zen 3 or newer. Before that, the pext instruction was implemented via microcode and was very slow.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy0WM1u4zgSfhr5UoggS7ItH3xI4gTbh95dYGZnGnMZUFTJ4oYiBZKS7T7ssy-KlGPZjjvTwK7aaNoiWVWsn68-hlkrdgpxEy2eosV2xnrXaLP5jUlRMfM2K3V13PyiW3SiRRulz7BHaNiAwKAUzjoj1A72DRqkmT1TDpwGPDjDuAOOxjGh_NIYXuJdPJ11DUK0TL5Fy8SvgNro1r_59u1bHMcxjdEyGdVypqDSUfYYJdsoOf1P04WAKF1CcniN0jVEq2eI0iK8zl6i7AVy_z4sSaJ0TbsmQn5thIWGWchBd2iYE1pZYKqCvZASHHtD-E8G_MhlcAKztm_p5OMRrkygA5G1plcgFHTMMClR0tmjZTKxapmANmcho33LhOwn9Uo74LpXbqKr1YNfvI6v_eA_zLbhDbR6AHqQHchmrMQ4MT62MWG-EjSfX86S-snu-WKcfn8fdqV5crlPB6E3Wt8NnFr9KB0axZwYUB4nYeZaVagsgqPICOU0MLBC7SRChwcHQllnek6BAq3gUCz_XObQMt4IhRb2wjVQtiL9zEfjc-WqrFgsLhd4pZenooEdfnC4lwEV1ExxgQb2upcVlAii9of0Pyt0yB202iCwamCKY0WV8MC14qhcyETomCM32Rj-RnUmLFjdIhhkEv4QO-C6Qtg3gjdw3ojWJ0wjdg1aR1JB14CMN1AeHX5QRv7zXezCm1qB3TPzVQ9fmX2L0mKIMor-0R079OWUJ78eO_xH7ed83T2dncZ12xFmUKmgcVSPeVKJ4eXA-PirFO4X8T0IuBbmP89QeLnZc5Rt_fcou1ChrAOt0JJlt-ZkWwhanpkdVTIbxt-QO23Cd_uJDX0xfqG1nWQkan6a_sii1pZ_tsy-eQuSQ0HV_Dja-TR1Om2J0tcofYXfEZhBwEMntfDVzqDtpROdFDxkAbNUA42oHRpfhayq0Hg0UpWPdYVGDGGxrkPpqL4t0YCwNxpto_cKPHB7ye8Bk1p38c1yD5HOCP5G-adrhwp2qNAwKb6zUiKBm3YNGuiMLiW2FpwmtIbGuY5CFCRZx_ibHtDUUu9jrtsofWVR-jrPF_kqS1YfpNDAjHfG3TAnV1EY99XaEJAmcfxpmH3XEL5zTLPYY0MvHc2O-XRKoSsZzzB_z1XfflY-6uI6RaLV9n4WvBxY20kM8JWlvmiFcrhDY98L9jJrTh13Uvg-F_quQxO66rTsr7BUibKUeBvtrT41nJ1hFVreaC1vElLutBGuaSfAbZHUMxc0zsmAG-FCecmldk63E6GqCultJweYzk5SXZE-VlXAqD03aPGD_KZKqc4shfJxh5ShMcXn79r5dhpTuzXYGbSoHFWCJwjX4lgcnnIc-ThW43izgaIfvz_zD4ebTQ__q2eUPIqFnzf_7sq7G-6tvO-gJ7i7-t6m_5-DSu4tGYd4HK4c9AEshly0ou3kMSQdlSNlsBG7xkF5pCzLUngIfC9KizNxJlCCVqg-5PyI1roGgxIHKm1aS8DiNAir5am8K7TCYEWiyfJ3Bj3WlkRmHRC1FzXVjgs9P51fHGM8wxdlOy-sPJ7g-h2tqX3FnSSMNk4E-vtqOZPMPFiLD60eqNXFjWvl2TkGXW8UHTWAZ_oYpQv6OXiSfGqQJ-QNbPgvkQJ4gM-x_Aywq-0P-NmrNjcgmz6Hvlki1y3aDy8bySFN8qSY35yLnBX-XZ4sLa40f1GeFrWlPN7jYTcU9Yp3T3RdLBNtL69Y6sneO-w_rE2LH3jqd88tR4g3WPWc2v1fNv2KXd8z3XPsnyHYBOGUsdTnH79u4fmf_zpHUBMHEN9Dq7KN59taySNxcNZ1UmBFRfUHKsjo0qJwT53hCWti49TDgqgPLhx7ZoHqHVtU1GIGwaAV3GhPxP21kVkY0BzBEsmZVZusWmdrNsPNfLleF8tkvspnzSbHJE1YlWVrVhVFzuerFSvzxbwu6zVfFNVMbNIkzebzeZas0nSRx3WdVyWvF8V8xZMc0yhPsGVCxlIObazNbias7XGzSpOimElWorT-bp-mCvfgJ6M0pau-2dCeh7Lf2ShPpLDOnqU44aT_o0C4WUWLLXzx8EOo5ZGGUqHEk5uDM9mNr2a9kZtLArgTrunLkfmRwnF46Iz-N9Ll4NWbSUjjj_HfAAAA___JpKvD">