[llvm] [SelectionDAG] Use Magic Algorithm for Splitting UDIV/UREM by Constant (PR #154968)

Fri Sep 5 10:24:45 PDT 2025

sharkautarch wrote:

@mskamp 
> But an integration of both approaches would most likely require measurements and benchmarks to determine when to select which approach to get the best results.

I ran an x86-64 benchmark for just the `@umod128(i128 %x)` function in `llvm/test/CodeGen/X86/divmod128.ll`, comparing the two PRs' latency & inverse throughput on my alderlake intel cpu using llvm-exegesis

I will refer to this PR as the "magic test", and the other PR as the "chunk test"

magic test throughput asm file: https://gist.github.com/sharkautarch/99adfce423c3952bad11b8db0a8ba150#file-128bit_urem_magic_test-s

magic test latency asm file: https://gist.github.com/sharkautarch/ad17b4045ad0cba89ae80126fb0f5f3c#file-128bit_urem_magic_test_latency-s

chunk test throughput asm file: https://gist.github.com/sharkautarch/6385ddf31e241156e6e49d901141b694#file-128bit_urem_chunk_test-s

chunk test latency asm file: https://gist.github.com/sharkautarch/783a8ffb970c7db4d2ed6d38c629e113#file-128bit_urem_chunk_test_latency-s

Disclaimer: I didn't disable hyperthreading before running the benchmarks, and also the computer I'm running these tests on has P & E cores, tho I *did* pin the tests to one P core, and pinning the cpu seems to significantly reduce the noise from hyperthreading & the P & E core split.

---

magic throughput test: `llvm-exegesis --mode=inverse_throughput --snippets-file=128bit_urem_magic_test.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1`
full output (for one run): https://gist.github.com/sharkautarch/99a3aa70b6b04ba8dbd7b95c8b0b4a6f
`per_snippet_value` is about 9.0-9.2 cycles

chunk throughput test: `llvm-exegesis --mode=inverse_throughput --snippets-file=128bit_urem_chunk_test.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1`
full output (for one run): https://gist.github.com/sharkautarch/0931a87e622bd1dc811b0b75bd1dab13
`per_snippet_value` is about 6.52-6.58 cycles

---

magic latency test: `llvm-exegesis --mode=latency --snippets-file=128bit_urem_magic_test_latency.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1`
full output (for one run): https://gist.github.com/sharkautarch/76a33c941d095854065aa821a7fa9090
`per_snippet_value` is about 26-27.7 cycles

chunk latency test: `llvm-exegesis --mode=latency --snippets-file=128bit_urem_chunk_test_latency.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1`
full output (for one run): https://gist.github.com/sharkautarch/4126ca66c3bc1414fbeec3eff1b28548
`per_snippet_value` is about 19.60-19.71 cycles

This is only comparing codegen from one test function, but I don't have time to benchmark other functions

https://github.com/llvm/llvm-project/pull/154968