<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/116796>116796</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Suboptimal code generation around BSR with redundant instructions on x86 and x86-64
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
zheland
</td>
</tr>
</table>
<pre>
## Summary
LLVM transformations of expressions with `BSR` (Bit Scan Reverse) instruction leads to redundant `XOR`s and `TEST` and prevents `LEA` arithmetics on x86 and x86-64.
## Details
If mathematically possible it looks like LLVM expands expressions like: `constfn(31 - clz(val))` to `constfn(31) - constfn(clz(val))` that results in a code generation like:
```assembly
bsr %[out], edi
xor %[out], 0b11111
shl %[out], 2
xor %[out], 0b1111100
```
Expected:
```assembly
bsr %[out], edi
shl %[out], 2
```
---
Also LLVM expands expressions like: `constfn(val > 0 ? 31 - clz(val) : def)` well to asm code like:
```assembly
bsr %[out], %[val]
cmove %[out], %[def]
... constfn transformations
```
But when the `def` is statically known, LLVM tries to precompute value returned in a zero `val` case,
and transforms `constfn(val > 0 ? 31 - clz(val) : def)` to `val > 0 ? constfn(31 - clz(val)) : constfn(def)` that results in a code generation like:
```assembly
bsr %[tmp], %[val]
... constfn transformations
test %[val], %[val]
mov %[out], CONST
cmovne %[out], %[tmp]
```
Expected:
```assembly
mov %[tmp], CONST
bsr %[out], %[val]
cmove %[out], %[tmp]
... constfn transformations
```
---
Both these issues simultaneously leads to pretty suboptimal code generation is some cases like:
```assembly
bsr %[tmp], %[val]
xor %[tmp], 0b11111
shl %[tmp], 2
xor %[tmp], 0b1111100
add %[tmp], 123
test %[val], %[val]
mov %[out], 1
cmovne %[out], %[tmp]
```
Whereas the same can be rewritten like:
```assembly
mov %[tmp], CONST
bsr %[out], %[val]
cmove %[out], %[tmp]
lea %[out], [4*rax + 123]
```
The optimal implementation can be achieved if using inline assembly `BSR+CMOVE` implementation (see examples below).
The same issues can be reproduced for
- `7 - llvm.ctlz.i8(x)`.
- `15 - llvm.ctlz.i16(x)`,
- `31 - llvm.ctlz.i32(x)`,
- `63 - llvm.ctlz.i64(x)`,
## Expected improvements
- Refined conditions when the transformation `constfn(BITS - clz(val))` to `constfn(BITS) - constfn(clz(val))` is necessary.
On x86-64 `BSR` is still usually used instead of `CLZ` for various reasons, so early transformations for some reason lead to unnecessary subsequent `XOR`s.
- Refined conditions when the transformation `constfn(val > 0 ? BITS - clz(val) : def)` to `val > 0 ? constfn(BITS - clz(val)) : constfn(def)` is necessary.
On x86-64 `CMOV` can use flags set by `BSR` and this transformation can sometimes be suboptimal.
Similar issue for shift instructions: https://github.com/llvm/llvm-project/issues/42942
In general
- `constfn(val > 0 ? 31 - llvm.ctlz.i32(val) : def)` and similarly `constfn(val > 0 ? 63 - llvm.ctlz.i64(val) : def)` are usually better to be compiled on x86 and x86-64 to something like:
```assembly
mov %[tmp], CONST
bsr %[out], %[val]
cmove %[out], %[tmp]
... constfn transformations
```
And similarly `constfn(val > 0 ? 15 - llvm.ctlz.i16(val) : def)` and `constfn(val > 0 ? 7 - llvm.ctlz.i8(val) : def)` should be compiled to the similar assembly with respect of missing 8-bit BSR and more optimal zero-extending operations.
## Other cases.
- Not applicable to `BSR` instruction which is directly transformed to `llvm.ctlz` and compiled as `REP BSR`.
## Produced assembly
For example below on i9-11900H using Rust criterion benchmarking library:
- `byte_sr_or_one_asm` or `byte_sr_if_non_zero` with blackboxed const are about 33% faster than `byte_sr_or_one`.
- `bsr_or_const_and_transform_asm` or `bsr_if_non_zero_and_transform_asm` with blackboxed const are about 50% faster than `bsr_or_const_and_transform`.
## Example on C
https://godbolt.org/z/71oY3h1e6
```c
#include <stdint.h>
uint32_t bsr(uint32_t value) {
return 31 - __builtin_clz(value);
}
uint32_t bsr_or_default(uint32_t value, uint32_t value_if_zero) {
return value > 0 ? bsr(value) : value_if_zero;
}
// Alternative` bsr_or_default that provides a more optimal code generation.
uint32_t bsr_or_default_asm(uint32_t value, uint32_t value_if_zero) {
uint32_t output;
__asm__(
"bsrl %[value], %[output] \n"
"cmovel %[value_if_zero], %[output] \n"
: [output] "=&r" (output)
// "rm" input constraint can not be used, because it worsens codegen: https://github.com/llvm/llvm-project/issues/20571
: [value] "r" (value), [value_if_zero] "r" (value_if_zero)
: "cc"
);
return output;
}
// Suboptimal code generation.
uint32_t byte_sr_nonzero(uint32_t value) {
return bsr(value) >> 3;
}
// ========
// Realistic example from real code.
// ========
// Good code generation for statically unknown `value_if_zero`.
uint32_t byte_sr_if_nonzero(uint32_t value, uint32_t value_if_zero) {
return bsr_or_default(value, value_if_zero) >> 3;
}
// Same using inline assembly `BSR` implementation.
uint32_t byte_sr_if_nonzero_asm(uint32_t value, uint32_t value_if_zero) {
return bsr_or_default_asm(value, value_if_zero) >> 3;
}
// Suboptimal code generation for statically known `value_if_zero`.
uint32_t byte_sr_or_one(uint32_t value) {
return byte_sr_if_nonzero(value, 0x08);
}
// Improved code generation using assembly `BSR` implementation.
uint32_t byte_sr_or_one_asm(uint32_t value) {
return byte_sr_if_nonzero_asm(value, 0x08);
}
// ========
// Artificial more complex example.
// ========
// Good code generation for statically unknown `value_if_zero`.
uint32_t bsr_if_nonzero_and_transform(uint32_t value, uint32_t value_if_zero) {
return (bsr_or_default(value, value_if_zero) << 2) + 123;
}
// Same using inline assembly `BSR` implementation.
uint32_t bsr_if_nonzero_and_transform_asm(uint32_t value, uint32_t value_if_zero) {
return (bsr_or_default_asm(value, value_if_zero) << 2) + 123;
}
// Suboptimal code generation for statically known `value_if_zero`.
uint32_t bsr_or_const_and_transform(uint32_t value) {
return bsr_if_nonzero_and_transform(value, 321);
}
// Improved code generation using assembly `BSR` implementation.
uint32_t bsr_or_const_and_transform_asm(uint32_t value) {
return bsr_if_nonzero_and_transform_asm(value, 321);
}
```
## Similar example on Rust
https://godbolt.org/z/e3K116oqx
Resulting in the same assembly, but a bit different LLVM-IR.
<details>
<summary>Click to expand</summary>
```rust
use std::num::NonZero;
#[no_mangle]
pub fn bsr(value: NonZero<u32>) -> u32 {
31 - value.leading_zeros()
}
#[no_mangle]
pub fn bsr_or_default(value: u32, value_if_zero: u32) -> u32 {
NonZero::new(value).map_or(value_if_zero, bsr)
}
// `Alternative` bsr_or_default that provides a more optimal code generation.
#[no_mangle]
pub fn bsr_or_default_asm(value: u32, value_if_zero: u32) -> u32 {
let output;
unsafe {
core::arch::asm!(
"bsr {output:e}, {value:e}",
"cmove {output:e}, {value_if_zero:e}",
value = in(reg) value,
value_if_zero = in(reg) value_if_zero,
output = out(reg) output,
options(pure, nomem, nostack)
);
};
output
}
// Suboptimal code generation.
#[no_mangle]
pub fn byte_sr_nonzero(value: NonZero<u32>) -> u32 {
bsr(value) >> 3
}
// ========
// Realistic example from real code.
// ========
// Good code generation for statically unknown `value_if_zero`.
#[no_mangle]
pub fn byte_sr_if_nonzero(value: u32, value_if_zero: u32) -> u32 {
bsr_or_default(value, value_if_zero) >> 3
}
// Same using inline assembly `BSR` implementation.
#[no_mangle]
pub fn byte_sr_if_nonzero_asm(value: u32, value_if_zero: u32) -> u32 {
bsr_or_default_asm(value, value_if_zero) >> 3
}
// Suboptimal code generation for statically known `value_if_zero`.
#[no_mangle]
pub fn byte_sr_or_one(value: u32) -> u32 {
byte_sr_if_nonzero(value, 0x08)
}
// Improved code generation using assembly `BSR` implementation.
#[no_mangle]
pub fn byte_sr_or_one_asm(value: u32) -> u32 {
byte_sr_if_nonzero_asm(value, 0x08)
}
// ========
// Artificial more complex example.
// ========
// Good code generation for statically unknown `value_if_zero`.
#[no_mangle]
pub fn bsr_if_nonzero_and_transform(value: u32, value_if_zero: u32) -> u32 {
(bsr_or_default(value, value_if_zero) << 2) + 123
}
// Same using inline assembly `BSR` implementation.
#[no_mangle]
pub fn bsr_if_nonzero_and_transform_asm(value: u32, value_if_zero: u32) -> u32 {
(bsr_or_default_asm(value, value_if_zero) << 2) + 123
}
// Suboptimal code generation for statically known `value_if_zero`.
#[no_mangle]
pub fn bsr_or_const_and_transform(value: u32) -> u32 {
bsr_if_nonzero_and_transform(value, 321)
}
// Improved code generation using assembly `BSR` implementation.
#[no_mangle]
pub fn bsr_or_const_and_transform_asm(value: u32) -> u32 {
bsr_if_nonzero_and_transform_asm(value, 321)
}
```
</details>
## Produced assembly
`x86-64 clang (trunk)` with `-O3` or `rust +nightly` with `-C opt-level=3`
```assembly
bsr:
bsr eax, edi
ret
bsr_or_default:
bsr eax, edi
cmove eax, esi
ret
; `Alternative` bsr_or_default that provides a more optimal code generation.
bsr_or_default_asm:
bsr eax, edi
cmove eax, esi
ret
; Suboptimal code generation;
; Redundant `XOR`s.
byte_sr_nonzero:
bsr eax, edi
xor eax, 24
shr eax, 3
xor eax, 3
ret
; ========
; Realistic example from real code.
; ========
; Good code generation for statically unknown `value_if_zero`.
byte_sr_if_nonzero:
bsr eax, edi
cmove eax, esi
shr eax, 3
ret
; Same using inline assembly `BSR` implementation.
byte_sr_if_nonzero_asm:
bsr eax, edi
cmove eax, esi
shr eax, 3
ret
; Suboptimal code generation for statically known `value_if_zero`.
; Redundant `XOR`s and `TEST` because of reordering.
byte_sr_or_one:
bsr ecx, edi
xor ecx, 24
shr ecx, 3
xor ecx, 3
test edi, edi
mov eax, 1
cmovne eax, ecx
ret
; Improved code generation using assembly `BSR` implementation.
byte_sr_or_one_asm:
mov ecx, 8
bsr eax, edi
cmove eax, ecx
shr eax, 3
ret
; ========
; Artificial more complex example.
; ========
; Good code generation for statically unknown `value_if_zero`.
bsr_if_nonzero_and_transform:
bsr eax, edi
cmove eax, esi
lea eax, [4*rax + 123]
ret
; Same using inline assembly `BSR` implementation.
bsr_if_nonzero_and_transform_asm:
bsr eax, edi
cmove eax, esi
lea eax, [4*rax + 123]
ret
; Suboptimal code generation for statically known `value_if_zero`.
; `XOR`s, `TEST`, `SHL`, and `ADD` can be reordered and replaced with a single `LEA`.
bsr_or_const_and_transform:
bsr ecx, edi
xor ecx, 31
shl ecx, 2
xor ecx, 124
add ecx, 123
test edi, edi
mov eax, 1407
cmovne eax, ecx
ret
; Improved code generation using assembly `BSR` implementation.
bsr_or_const_and_transform_asm:
mov eax, 321
bsr ecx, edi
cmove ecx, eax
lea eax, [4*rcx + 123]
ret
```
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzcW1tz4rjy_zTKiwrKyIHAQx6AJP-d-s_ubCVTe87ZF0q2G6wTWWIlOSHz6U9JvuArt2R2ppZKzQBqSX39dbcsqNZsIwBu0XiBxndXNDWxVLffYuBURFeBjN5uEfER8fFTmiRUvSHvDnnz7N_Pn__4FRtFhV5LlVDDpNBYrjHstgq0dh9fmYkxmniLp0c08TAi0wUz-CmkAj_CCygNiMwwE9qoNLQrYA400thIrCBKRUSFsfP__cXO15iKyH78ev_01a5nP24VvIAw2n7_-X7uvlbMxAkYFmosBd5NJ45yN50MJtfDqgy5dHdgKOO6OvJpjRNqYrCChZTzN7yVWrOAA2YGcymfNebsGbBTA-y2VES6JrsdRf7c8hVKoc1aIDL1R3iAQ_4NkekL5YjM7N_EswI36KxiBnj_TdekmBqsQKfcaMwEpjiUEeANCFA002bGQy7YxMv-qNaQBDy3ZqAVti9Exmi8kKlB4ztElhgilhHsZDeBF4zsKyPSMW8RkJPme16DvaoZ7ndbCA1E7xbC8tdFQA5sPhgMqh_nXMszzf1COUb-PfYw8h9wy_jYTohgndvzFTi3nkB1klnyPebLPtt9xncZZZjIF-iltGwUlMPhsPC8ZoR3qmuRGvwag8AmBqsAu9bEw0xjbcr4eRbyVdj9cuBg4OJ8qyCUyTY1gF8oTwErMKkSEGUe_Q2UCw0ryMTDIbWQscy2tUFdsqffo3lT7FGZczho3QJ7kspS3yEoTbLttepRWxnQZr9UPrVzqUS-dLrS8stvT1_3PiSgx4VyNt8fzXVG9tJXGPkot6_yfK7bd6LEQprYxoEGzLROQWPNkpQbKkCmmr_tM9xWgTFvWKeB3BqWUN7yFBtAMgHn9fqD_aaOynvKFqp3EXUie2uNAtlpFHWSjYj_US46qrnnpf75rxgUUO1gTFOneIEDC0mvihkDp8XuD_JeDrR7zfHiGpG5ojuMyMIpvUcJX2PAhSuyZMshAWEyT8wVQcOYwYvF5jVONRMbzARnAnChgaLYI4vlr1_-uHdZoL4SIlMNgGFH7fcaB8DlKyKzWl32tTBAHkKlHbZKRmkIEV5LlYef3fIGDzDnL8kwNPzbkE0Rme4yPB5WqEbjOtlosqcrk4qjdIhfofRJH-XEr1NOrjsoa8VmAYJWMUq-ONXk-IJxsfIjrJnNgaEUEctK6zLB1qGpnvYWn74-nVZhWsoTakymsYAQtKbqbVhw-UXkxXSltnfJnnGOU526fJ9ql8O1ARrZvgBNvOXnPy3pWir8QhWTqcY23iy-kiXWEgNV_K3VVFh6h4MZsUNQK08qStYsimr4K4Vav1DzqcuVWq8LOlR8VjXRbaLeeuK4AWykZbWRsErHa043GmswOHirGMhVSzHTTVHtNKtdwxIXjpWEVO73xBLGqcrCMbNHzNam2rlpK0BszNa-QeQBkYcNM3EaDEOZIPJgYyT_b7BV8r8QGkQesvhG5OGazK5JrQETeSbklWg7WN8147XTNlYLOpMmQ6u-JTsDu3tJBaXTB2AMKOsBAWBb1zIOUbsJtQRO57HF0J84qZxbEs1PVW8nFvda7MBCHdDfuYyOZcqjmlmMzDJ97txlDnOnFgq0xWmLXAnTLtdNBwEzePH06HhKpNqnS9umDGBnQESWUm7zEk53HTd8MTGorKhrINRv0mC63XIW0oBDDiQFwFZOSV5jFsYWGiKmIDRVzMzkQhOvVEqhxFJu6nqlx_vfcbZ0F4-_F4m27obZvw9SFQk8y9_Ww9lsMBrNPO-XvDJ4TLXBoWIGlGU5ABHGCVXPmccHiqq30uldcAdvBlZaraRaSQErqhPLuVTVMbZeCSlWVt2uY7aWCjgNnwO5y5BdGxeQNJCpwb6PyBivqXZBGVPR3qdRIwTZgFtoRUW0KhXbYKjOTDfpMfbGXhd7vRz0mOo-t4QUuFZvNMBYRoHkZijVBpGHb4g83Izkf_x4BJPamkU8h-UeTIQ8jQAjf6lNxIQZxsi_r05KmTA-WRkcaIXItPzoGnoXijeLIpXgvMHPMHu1ClLGDROrMh-6GcjPJ6Cbu76drJYiWNOUm45Nl7j-jbWWc5sedrLDhz2uZKLsJfDnjXV6OMy0jefcgBLUsBfrYQ12s0MCW_2xCDSmdSxpdIHDg5I7V3uH9CWZTM02NaVUdmxlF1-tEJnm3yFCAq34PqOkUMsc-RrjO4zGS4EI2a-VJSbiMk9tgZKzcxby57hORQjy7xCZKESIbTHyITJrcuCMgwhRiaVkYpuaLCwVZcK4YkhIY9OErV8tRwGE1JZVzOBXqTQI7Sy0AfG-iod445tRp1yFYh2fuUBlYGQdXVN1LdKKwTv2ICQM91qtxlslIhoe0e3nT72nFy2_zWFXSJExdiJOtALx3gapf4Qv6w6df1WiR6CcacPCMputlUxsm5FJMzxnySrp_0kZtU5zXNm8Pw9NhTsRzVuFisVKlG-pLks4Pdo7F_BaCFqu055-ms6fbMd-8FigdRxwXNL3AlyntPmi75e4_-yuYe2zbZ1XJydHSZeLlPJ5O296IKvm0nzKDiTanpvZ9GJjVgq6y8Vpmuw0kU6CgbkybM1CRnmWiG2hzGFXgMJPgwINfdRKw0sjJNc3ItOz8GCJ_CUm7n1-pvj9oeGA-B8EEi0tnIATZ2viO0BGf8dwON4q2HjAsUrpfTL6ASByuCE7o4o45j3H5ex6ZlBckMgPEWDfjNkG-Jx-DPz_H40m8q9dddKje5KYxc3-uUTZlNv6NDWY4oAZHLH1GhQI4x6xDj491rtFfxnl9xzK7s1f6vxih3-_5Cx8xkbmT7etZ5OH_XBnk6hKEW2FrI17sufPRZpkb36T4s9ar1SqbLwQcpVQseFQHjZt0wCv6wWfP8flGsvUJ5YRMsMDm5xTn9RN7TpKN2_IgUZMbFzEaNvAFGVwy2mPsdKFif7cbt4GhOL7TgZLOZyG4LVS1A4Tul1J1a7dl5ky-njPEtLE-_Bm8yy11MLnItVgjDl0tqCp0HQNFdJQKshUSFUY5-_s_qN9m1p5ZR2rnV8sPgerRttF3SwKlrOvSPm8puxUD06syNZaoPIqDhbuMBOITBVsrBYKsOmhL9bunldxkHx-xqSjlrbvLciLPrhjH-sB7snLdJsqh3tCJpBkb7Sh4fO-d2x2iFbc6ud8m_f0icdcrtU9XgAQfa3kP7yPPFG1Ha3DpdF8SW_5XarHs0X_EDC7tNn8W8rGE1VS9p91VfSKfFL_-f3rxrOk6zT26RL2taT_7H70hNrgaDNxUWC9v0f9IRBzauPxMTq5qGP9KXDnUBN7YoSe18f-eDA61tqeIPYFfe0pTa1r_lrtYrXhPfiAGE28_KJDyKnYWDc1KhXPxUXn_GcBgy_-_omq7SWtRwq2iQ1_q9EtbbU64PACHPl3fpPb3ruQ5QPmougt7kgA3dWuh2MFtV69ATQnL5O_suahGNWN0cZeyF98lyauAxgukwOfI0k_YOxPAfwFfuz4eUfBd6PU72K6j-HiRmo-Tq7rwzquDfvtyd0jXRY7nHediKd1BccXK-k-JoN31GonO8Yxjzis4S5_uTjx9hRkpSjvde7zRfm4XNkTIc0fQBXPpuUaK5AqAsXEpqGevJDvtXDYH0n5WG8UhX1R1D_qrnrbzbr2LC655Rofte0lYG-vcNeN3VZ5H5TAO9qFph5LnjNxpxc6XynMYee7CIhO7Cf-fhw6VK99dKoqrsfn4wduxX8wSh0r0H4yQT8Uw_bI5VgpgSv_9PTL5_xDjmvzu7viJrO77u8gzdaYIsIKtpzagtNVhRRbY3DY_-yzXvV09RGdhcQB-Kui2KgZnrw6XNykacwaNZGz-CVMOd6Fjna8CZBNZLz2bk4Ex_z1HTHycBfTi5c5sJHReXmpdP98nO7a7t9w_bDD9VtqqbdBV9GtH838Gb2C29GNTzxCrol_Fd9Oo-vrYDaK1tNxOPU9b-yNyWRGxzfrm3Uwibwrdks8cj0ajWYjO80fzibEm87C0PfpdBKEN-jag4QyPnQXdKXaXLkrYbej0eRmNrniNACu3U-yCRHwmt25R4Sg8d2VunW3yYJ0o9G1Z6tLvV_GMMPh9lAEUyVTEbkbzPkd56LEqF7kb19Xv0oVv738plsu2cst-V8AAAD__3KObDw">