[llvm] [SelectionDAG] Use Magic Algorithm for Splitting UDIV/UREM by Constant (PR #154968)
via llvm-commits
llvm-commits at lists.llvm.org
Fri Sep 5 10:24:45 PDT 2025
sharkautarch wrote:
@mskamp
> But an integration of both approaches would most likely require measurements and benchmarks to determine when to select which approach to get the best results.
I ran an x86-64 benchmark for just the `@umod128(i128 %x)` function in `llvm/test/CodeGen/X86/divmod128.ll`, comparing the two PRs' latency & inverse throughput on my alderlake intel cpu using llvm-exegesis
I will refer to this PR as the "magic test", and the other PR as the "chunk test"
magic test throughput asm file: https://gist.github.com/sharkautarch/99adfce423c3952bad11b8db0a8ba150#file-128bit_urem_magic_test-s
magic test latency asm file: https://gist.github.com/sharkautarch/ad17b4045ad0cba89ae80126fb0f5f3c#file-128bit_urem_magic_test_latency-s
chunk test throughput asm file: https://gist.github.com/sharkautarch/6385ddf31e241156e6e49d901141b694#file-128bit_urem_chunk_test-s
chunk test latency asm file: https://gist.github.com/sharkautarch/783a8ffb970c7db4d2ed6d38c629e113#file-128bit_urem_chunk_test_latency-s
Disclaimer: I didn't disable hyperthreading before running the benchmarks, and also the computer I'm running these tests on has P & E cores, tho I *did* pin the tests to one P core, and pinning the cpu seems to significantly reduce the noise from hyperthreading & the P & E core split.
---
magic throughput test: `llvm-exegesis --mode=inverse_throughput --snippets-file=128bit_urem_magic_test.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1`
full output (for one run): https://gist.github.com/sharkautarch/99a3aa70b6b04ba8dbd7b95c8b0b4a6f
`per_snippet_value` is about 9.0-9.2 cycles
chunk throughput test: `llvm-exegesis --mode=inverse_throughput --snippets-file=128bit_urem_chunk_test.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1`
full output (for one run): https://gist.github.com/sharkautarch/0931a87e622bd1dc811b0b75bd1dab13
`per_snippet_value` is about 6.52-6.58 cycles
---
magic latency test: `llvm-exegesis --mode=latency --snippets-file=128bit_urem_magic_test_latency.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1`
full output (for one run): https://gist.github.com/sharkautarch/76a33c941d095854065aa821a7fa9090
`per_snippet_value` is about 26-27.7 cycles
chunk latency test: `llvm-exegesis --mode=latency --snippets-file=128bit_urem_chunk_test_latency.s --repetition-mode=middle-half-loop --execution-mode=subprocess --benchmark-process-cpu=1`
full output (for one run): https://gist.github.com/sharkautarch/4126ca66c3bc1414fbeec3eff1b28548
`per_snippet_value` is about 19.60-19.71 cycles
This is only comparing codegen from one test function, but I don't have time to benchmark other functions
https://github.com/llvm/llvm-project/pull/154968
More information about the llvm-commits
mailing list