[llvm] [SelectionDAG] Use Magic Algorithm for Splitting UDIV/UREM by Constant (PR #154968)

Fri Sep 26 01:47:05 PDT 2025

mskamp wrote:

I've finally done some benchmarks for AArch64 on a Cortex A76. Here are my results:

[Latency](https://gist.github.com/mskamp/4fc756e52c01b69e9739add4d6e27d69)
[Throughput](https://gist.github.com/mskamp/af08af2b9731ed988deaa72284830d44)

The numbers for the throughput look a bit strange to me. I'm not sure I've done everything correctly, though, so take it with a grain of salt.

With respect to the latency, the chunk-based algorithm yields better results in cases where the magic algorithm has been superior in the Icelake benchmark before. I think this is because the Cortex A76 does not support an operation that gives both the upper and lower word of an unsigned multiplication (such as `MUL(X)` on Icelake).

If we wanted to optimize the latency, we could use the following check list to determine the best algorithm based on the available measurements:
- If the chunk-based algorithm does not support the given divisor: Use the magic algorithm.
- If the operation to compute is `UREM`: Use the chunk-based algorithm.
- If `IsAdd = false` in the result of `UnsignedDivisionByConstantInfo::get()` for the given divisor: Use the magic algorithm.
- If the target does not have a legal `UMUL_LOHI` instruction for the required bit width: Use the chunk-based algorithm.
- Otherwise: Use the magic algorithm.

Of course, this only fits (and likely overfits) the given measurements. The check previously suggested (essentially the first two and the last line in the check list) would probably still yield good results in most cases (in the benchmark, it would only yield suboptimal results for 4 data points). Hence, this is probably still a good start if we wanted to avoid overfitting.

https://github.com/llvm/llvm-project/pull/154968