[llvm] [CGP]: Optimize mul.overflow. (PR #148343)

Mon Jul 14 05:46:37 PDT 2025

https://github.com/davemgreen commented:

I ran an experiment a little while ago, about 6 months ago now. It was to try and measure what the average cost of a divide should be, given the distributions found in real code. It used dynamorio to interrupt the program whenever a divide was found and print the numerator and denominator. In doing that that more that 20%, across all of the llvm-test-suite + 3 x spec + some other benchmarks, about 22% of all divides were 1 divided by 1.

So I can fully imagine that i128 smulo are often biased towards low values and it is quite a bit more efficient to check for the upper halfs being zero and jump straight to the "no overflow". GCC apparently considered it useful enough to implement and we see cases where it is performing better from using the expanded form. It sounds like we should maybe make it opt-in by the target for each type as it might be better/worse depending on the relative performance (like whether there is a mulo instruction that sets flags or requires a libcall anyway). The asymmetric case that GCC runs can also probably be removed to cut down on the codesize, and it looks like the signed case is more beneficial than the unsigned case. It might be worth focussing there to begin with.

https://github.com/llvm/llvm-project/pull/148343