[PATCH] D117912: [IR] document and update ctlz/cttz intrinsics to optionally return poison rather than undef

Craig Topper via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Fri Jan 21 15:55:17 PST 2022


craig.topper added a comment.

In D117912#3262669 <https://reviews.llvm.org/D117912#3262669>, @zielaj wrote:

> I'd like to provide more context for https://github.com/llvm/llvm-project/issues/53330. This issue arose while working on performance-critical code in primesieve <https://github.com/kimwalisch/primesieve>, which uses the output of `__builtin_ctzll` to do an array lookup <https://github.com/kimwalisch/primesieve/blob/master/include/primesieve/Erat.hpp#L85>. It turns out that, because of poorly predictable branches, it's faster for us to compute `__builtin_ctzll(0)`, do the lookup, and then ignore it later, than to call `__builtin_ctzll(x)` conditionally on `x != 0`. Note that we don't need `__builtin_ctzll(0)` to be equal to 64 or to any specific value, we just need to be able to bound it somehow so that the array lookup doesn't segfault. This is where the idea of `& 0x7f` came from. The hope was that when compiled on processors that support `TZCNT` this would be optimized out to nothing, yet still work on processors that don't, at a cost of one additional instruction (AND).
>
> The fastest workaround we have is computing `__builtin_ctzll(x | (1ull << 63))` instead. This adds one more 1-cycle instruction (OR), is always bounded, and is equivalent to `__builtin_ctzll(x)` for non-zero x. Still, since the calls are in a hot loop, this additional instruction results in measurable slowdown, this is why primesieve currently uses inline assembly to force `TZCNT` when possible, because we couldn't find a C++-only way to accomplish this.
>
> This comment is not a disagreement with this proposal, it just provides more context. I accept that our need may be too obscure to be addressed by clang, and wanted to say thank you for responding to the issue I raised so quickly and for clarifying how undefined `__builtin_ctzll(0)` is.

Does `x != 0 ? 64 : __builtin_ctzll(x)` not produce tzcnt when it's available? I would hope you wouldn't need to resort to inline assembly.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D117912/new/

https://reviews.llvm.org/D117912



More information about the llvm-commits mailing list