<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/82487>82487</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Count-trailing-ones emulation emit is suboptimal
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
Validark
</td>
</tr>
</table>
<pre>
For my RISC-V Sifive-u74, LLVM emits David Seal's algorithm for count-trailing-zeroes emulation, since it lacks nice bit manipulation instructions.
Unfortunately, when I write a count-trailing-ones function, it misses the obvious optimization. In Zig: [Godbolt link](https://zig.godbolt.org/z/Tac6YbT1q)
```zig
export fn ct1(x: u64) u8 {
return @ctz(~x);
}
```
```asm
.LCPI0_0:
.quad 151050438420815295
.LCPI0_1:
.ascii "\000\001\002\007\003\r\b\023\004\031\016\034\t\"\024(\005\021\032&\017.\0350\n\037#6\0252)9?\006\f\022\030\033!'\020%-/\036518>\013\027 $,47=\026+3<*;:"
ctz:
li a1, -1
beq a0, a1, .LBB0_2
lui a1, %hi(.LCPI0_0)
not a0, a0
neg a2, a0
and a0, a0, a2
ld a1, %lo(.LCPI0_0)(a1)
mul a0, a0, a1
lui a1, %hi(.LCPI0_1)
srli a0, a0, 58
addi a1, a1, %lo(.LCPI0_1)
add a0, a0, a1
lbu a0, 0(a0)
ret
.LBB0_2:
li a0, 64
ret
```
The same issue occurs in C as well: [Godbolt link](https://godbolt.org/z/oxKPWKnTs)
```c
uint64_t ctz(uint64_t x) {
if (x == 0) return 64;
return __builtin_ctzll(x);
}
uint64_t ct1(uint64_t x) {
return ctz(~x);
}
```
```asm
.LCPI1_0:
.quad 151050438420815295 # 0x218a392cd3d5dbf
.LCPI1_1:
.ascii "\000\001\002\007\003\r\b\023\004\031\016\034\t\"\024(\005\021\032&\017.\0350\n\037#6\0252)9?\006\f\022\030\033!'\020%-/\036518>\013\027 $,47=\026+3<*;:"
ct1(unsigned long): # @ct1(unsigned long)
li a1, -1
beq a0, a1, .LBB1_2
.Lpcrel_hi2:
auipc a1, %pcrel_hi(.LCPI1_0)
not a0, a0
neg a2, a0
and a0, a0, a2
ld a1, %pcrel_lo(.Lpcrel_hi2)(a1)
mul a0, a0, a1
.Lpcrel_hi3:
auipc a1, %pcrel_hi(.LCPI1_1)
srli a0, a0, 58
addi a1, a1, %pcrel_lo(.Lpcrel_hi3)
add a0, a0, a1
lbu a0, 0(a0)
ret
.LBB1_2:
li a0, 64
ret
```
David Seal's algorithm starts by checking our input, `x`, against 0. If `x` is 0, we return 64 (for 64-bit count-trailing zero). Then, we compute `y = x & -x`, where `y` contains just the lowest `1` bit of `x`. Thus, `@ctz(y) ≡ @ctz(x)`, but crucially, there are only 64 possible values of `y` (64-bit bitstrings with a popcount of 1). The rest of the algorithm is a perfect hash into a lookup table.
If we trim down the algorithm, LLVM finds the optimization. [Godbolt link](https://zig.godbolt.org/z/v61Efe4Gh)
```zig
export fn ctzStart(x: u64) u64 {
if (x == 0) return 64;
return x & (~x +% 1);
}
export fn ct1Start(x: u64) u64 {
return ctzStart(~x);
}
```
```asm
ctzStart:
neg a2, a0
li a1, 64
beqz a0, .LBB0_2
and a1, a2, a0
.LBB0_2:
mv a0, a1
ret
ct1Start:
addi a1, a0, 1
not a2, a0
li a0, 64
beqz a1, .LBB1_2
and a0, a1, a2
.LBB1_2:
ret
```
The optimization is when we insert `~x` in for `x` in `x & -x`, getting `~x & -~x`. It happens that `-~x` is the same as `x + 1`. It also happens that checking `~x == 0` can be performed by checking `x + 1 == 0` or, ya know, `~x == 0`. In the count-trailing-ones function, LLVM instead loads a `-1` constant and compares against that, which can't be used for anything else. It also takes the bitwise inverse of `x` (`n = ~x`), and feeds that into `n & -n`. This is worse than simply doing `~x & (x + 1)` because `x + 1` and `~x` may be computed in parallel, whereas with `~x & -~x`, the hardware probably isn't going to fold `-~x` into an increment.
This is the emit I get with the optimization written out manually:
```asm
.LCPI0_0:
.quad 151050438420815295
ct1:
li a1, 64
addi a2, a0, 1
beqz a2, .LBB0_2
lui a1, %hi(.LCPI0_0)
not a0, a0
and a0, a0, a2
ld a1, %lo(.LCPI0_0)(a1)
mul a0, a0, a1
lui a1, %hi(.L__unnamed_1)
srli a0, a0, 58
addi a1, a1, %lo(.L__unnamed_1)
add a0, a0, a1
lbu a1, 0(a0)
.LBB0_2:
mv a0, a1
ret
.L__unnamed_1:
.ascii "\000\001\002\007\003\r\b\023\004\031\016\034\t\"\024(\005\021\032&\017.\0350\n\037#6\0252)9?\006\f\022\030\033!'\020%-/\036518>\013\027 $,47=\026+3<*;:"
```
<!--
The count-trailing-zeroes of the previous expression is the same as the count-trailing-zeroes of the input, i.e. `@ctz(x & -x) == @ctz(x)`, for `x != 0`, but we now reduced the problem so that there are only 64 possible bitstrings. Then we multiply by a magic number and grab the upper 6 bits, as a perfect hash function mapping the 64 possible bitstrings with a popcount of 1 into a 64-entry lookup table with the count-trailing-zeroes precomputed.
-->
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsWc1u4zgSfhrmUrAhUZYtH3JInM4gmF5gsN07i91LQEllixOKVJNUHOfQz74oSrItx0lP92QHcxgjUCz-1B_J76uihXNyoxEvWXrN0psL0frK2MtfhZKlsA8XuSl3l7fGQr2Df959Wk1-hU9yLR9x0i5mjK_g48df_wFYS-_gRjzKEj6hUIwvHAi1MVb6qoa1sVCYVvuJt0IqqTeTZ7QGHWDdKuGl0STKSV0gSA9KFA8OtCwQcumhFlo2_TiQ2nnbFvTdTVl0w6Kr7vkvvTbWt1p4VDsSt61Qwx1srfQI4tQAo9HButXFoJ0USefQga8QTP4oTevANF7W8jnonsKdhv_KDUuugKXXP5kyN8qDkvqBpTeMZ5X3jWPJFeO3jN8-y810042ZGruhFsZvP4ti_p_8c_yF8eWx-WwedX_PctO14FNjrIe1hsLHjGdPpLedzxhfQpsBW1x34wAALPrWamCzqPDPjGdfn0h80o9gi5sTJWc1C1d3LdOPq1_uovuIXNmroM_0SytKAIjTOEqjWZLNeJTFKV-mo4nxy4nCFVICMM5ZuoqiKDzj8OxaFuGZsHRlWbrK6Y0noW1GzySMjefhO7V4lq56aXzGeBaGpuE1DE0442F0vJiG95R06vB1wXgS-njKGV8uWXIbplPbOrQHuUmwMkkYjxkP9vGI8XRCy0sd8zTOWPIhKAmm8gUwMmY1W7DkJrTMGb9OWLJi_IqWg_YG70JDC3UaJiW7_yKmHTmJx705ful6I-rtxkw_Xl9H9_xETCuPxDCeVpLxbL-qw8Ybhmvjj-VGJ7246Xr5uV6hy_Fcep6aU8LYHGVOzOEZ9Z2YVbfqpej4Ozx9KdLZLsLHItPsxKOylAeRZ21-KViUZ8JwamveHo2JyOmXa2HRD0epW9hXt0iQMp-9Mv3sSf9cIThRI0jnWgRTFK11IDWsQDjYolK_E9pewpp5-vmXf_-sP7vXYK3o3lup_Xx276HDqf0r4dUY0uQaCPSATlJyQ_FaDjA3n-2h7Qj87u_zViov9X3hn5UKiHkWBE8Nid82pBf_bsAafxewwujDeALRE48zkSx5USZlWubrkey_sfd3YW9Ycx1ynxKU0ZuwrFfwjQ_FP5Ds2fnvgOXxgOXTj01hUd1X8iUGiFY2xTHqDUMHiIr_CijfGdXj5sGZH4H7w_zkR4LxrkRw1q3kT2KE-J0Z4dWU3XlhvYN8B0WFxYPUGzCtBamb1odAzKMnkkVebQQl5RBN4W49dIB0EHq3eEBtAnSqBeazCeX144wcqCRgfDmFzxXqfmph6qb1SFJ3xATwBIzPYTLo3lZou17SWRjtyRb4rXU-pPHKbNF5GhDTANJq9jaSptb13uxT510ggA-cZUt2FR9S6gD8nda89VDYtpBCdaWGD2YIi2C02pGnjXFO5grhUagWXa81WMl41gcgl955K_XGwVb6CgQ0pglRofHxEAyw5INZB48OSyQdTUC7xsJDJVwFUnsDApQxD20DXuQKRyXS3Zpi6q2soTRbPZa3L-bWUpd9ETQqfn645Hmcxx_WOPup-q6S5_kTbcHTuoc20R_KEroNFJgcGL9mPA2Bfi1NGBVhv8uiQ7owDP9jOcNe0Om5fxuyx_Rzigo5fnk-YMbZImKP-XGP9kdKXktO68djKDqFuwMUDSR83rET4A2y4lfI7G3fzyLi3veXpPvC9yN2FvxtGP526n18nOj0hguKLYLUDm0Aqa8ddOpwY7JHUh2-joFvg94TaIZJXdfXDtPuCAyaBjWdYRHETr4OkOyHCkC4Qeg1xMM8oZwZT96Df69nOGUEtkJDjgGAjK2xHHHFQfZojrFk-07AgzbbHnjHYsMdC1n5reuaAFXEOygoARMloSH5GvdM4LzQPqwkkYiw6PZMRa517CGLihxhfOHJmdZhGWIv9M5X5Agqh4fYePHQ3w_l0m-lo7V7ROvwiFUIW9g80oGuvnbrtQwbSJewRiz70Aaw7gbS6umekKQLW8OQUF8JDU7WjdpBacbL3aFeWL3AS5BjIVqH42UNSvcbqxY78rIn1ZJ2ViOsUArVnkxFz0WnG6unOaiELbfEdI01ucjVDqTr4rcJFnoDa6PK420XWEmD1IXFGrWfjg9G5zHJxlp6uKO93dlwykHhLs-jBtOGW8E28O9wEN_9Nouy_LdvaE6hZY9c_FXk2qMP_3Oub_6iFzT3963Wosby_3JL85b070rM4zOJ-buQ39jGv0v2b5bs57MlGhlPJnAg2fO_M_TJc2Oxu9bHp8aicz0RH7PiGe4Zy9iXQHKK01HlsKdoKiA6SjtXPQzcDhSonvaGqmKLoM0WLJZtgWVvs8kV1kD0Q7zxRq1xqCe6IorE1a3ykggk34GAWmxkAbqtc7QBGjZW5EFN2zRoYR5khP37oroY2Bdq0TQB6it8Rf3ZcmYoT-azCWpvd6M65QD454PfWBxoq6ePyYQlHy7Ky6RcJktxgZfxIsrSOFvy-KK6LJYRzpa4iIvlYhbFs2idZrNiWSZY5ssiSy_kJY_4LOI8jnmSxstpEmG5nkfzLMri5TyN2SzCWkg1VeqxpmrmItyaXmZ8li0ulMhRufCrGecat92VajhhNxf2kuZM8nbj2CxS0nl3kOKlV3i5OpPe7H8M64hQOnBtHvhPqIvWqsuTS1jpqzafFqZm_JbE9_8mjTW_YeEZvw1GOcZvg9H_CwAA__8EG2Iw">