[libcxx-commits] [libcxx] libcxx: Optimizations for uniform_int_distribution (PR #140161)

Wed May 21 19:55:41 PDT 2025

LRFLEW wrote:

> Do you think we should use a different algorithm that has improved performance?

Short Answer: Maybe, but what to replace it *with* is a non-trivial problem. Also, this PR addresses a lot of the concern I have with the current method.

For context, I'm going to reference the algorithms as described in [this blog post on the PCG website](https://www.pcg-random.org/posts/bounded-rands.html), because it's one of the best write-ups of the various resampling methods I've found. Note, however, that the code snippets (and to a lesser extent, the algorithms themselves) use a much more narrow contract than `std::uniform_int_distribution`. The algorithms on that website are all narrowing (i.e. the output range is smaller (or equal to) the PRNG's range) and the PRNG is assumed to be a 32-bit PRNG (i.e. the output range is the full range of unsigned 32-bit integers). In contrast, `std::uniform_int_distribution` needs to be able to support any possible input or output range.

The algorithm used here is basically the "Bitmask" method described there. This method has a critical trade-off, where it avoids having to do a more expensive division (or multiplication), but it generally performs significantly more sample rejections. Whether this performs better or worse depends on multiple factors, such as how fast the PRNG algorithm is, how fast division is on the device, whether the CPU has a CLZ instruction (or if it has to be emulated in software), and what kinds of ranges are actually sampled.

There is one more thing to keep in mind with this though: the "Bitmask" method assumes the PRNG is a power-of-two PRNG. Since `std::uniform_int_distribution` can't make that assumption, it has to coerce the result of the PRNG into a binary range. This means that non-Po2 PRNGs, including minstd, have to go through two rejection samples, with further impacts performance. Plus, the method used doesn't seem to be particularly optimized for runtime performance. This PR primarily addresses the runtime performance of this step, which helps a lot with closing the performance gap where it occurs.

There are a lot of alternative options if we did want to change the algorithm. From what I've seen, libstdc++ uses a combination of the "Debiased Integer Multiplication — Lemire's Method" and "Division with Rejection (Unbiased)" methods depending on the PRNG and available instruction set. There's also the different modulus methods, which might be worth considering. It's also worth keeping in mind that the current method provides the widening method (where the output range is larger than the input range) basically for free, while switching methods will require deciding on a method for that as well.

If there's interest in possibly changing the algorithm used, I'd be open to having and participating in a larger discussion around it. It should probably include performing performance tests in a wider range of hardware than I personally have access to, as we probably shouldn't just make assumptions about all hardware based on modern high-end hardware. In the meantime, I think it's worth considering this PR on its own, as it addresses a lot of the concern I have with the current method.

https://github.com/llvm/llvm-project/pull/140161