[llvm] [SelectionDAG] Use Magic Algorithm for Splitting UDIV/UREM by Constant (PR #154968)

Sun Sep 7 06:34:03 PDT 2025

mskamp wrote:

@sharkautarch Thank you very much for your time and work. These are some valuable and helpful numbers.

I've started with a comparison of the approaches based on the output of `llvm-mca`. The rationale is that this way I can get a rough picture of the relative performance of both approaches for a wider array of hardware than is currently available to me. I haven't finished this and the results are only preliminary, but it still appears that the chunk-based approach is better for `urem` whereas the magic algorithm yields better code for `udiv` in most cases (i.e., when `magics.IsAdd = false`).

Based on your work, I've run some benchmarks for all cases listed in issue #137514 on my Icelake Intel CPU. As input for the experiments, I've used the following files:

- [Assembly output for Magic algorithm](https://gist.github.com/mskamp/55a9b44da49290f29c7f379cb8e10aba)
- [Assembly output for Chunk-based algorithm](https://gist.github.com/mskamp/49fe14e87e6d2bb7313ee305c647bbf1)

I've turned all the usual performance features of the CPU off (e.g., turbo boost, hyperthreading) and performed 1000 runs for each algorithm and constant. The results are summarized here:

- [Latency](https://gist.github.com/mskamp/084ca92abeeeeaddb95ef9c5dd21aa28)
- [Throughput](https://gist.github.com/mskamp/f28e64a2c37b8d90ad14fd610c55171c)

The results indicate that using the chunk-based approach for `urem` and the magic algorithm for `udiv` and the cases that are not handled by the chunk-based approach (3 constants out of 11 in the test) is probably not a bad heuristic. There are some cases, though, where the throughput of the magic algorithm is a bit higher than the throughput of the chunk-based approach even for `urem` (e.g., the constant 19). As a start, we might, however, very well use this heuristic to decide which approach to use.

I'll plan to perform similar tests on AArch64 next since the initial investigation with `llvm-mca` suggests that the results of both approaches might be closer to each other and less clear-cut there.

https://github.com/llvm/llvm-project/pull/154968