[llvm] [CGP]: Optimize mul.overflow. (PR #148343)

Sun Jul 13 11:09:12 PDT 2025

aengelke wrote:

I don't think that's a good idea as this is currently implemented and I'd like to see some benchmark results.

In our database system, we make heavy use of 128-bit mul-with-overflow. Having spent some time optimizing this specific case:

- We branch for `(a.hi|b.hi)==0` to a fast inline path (as a single branch, on x86, `mov tmp, ahi; or tmp, bhi; jnz` is typically just one uop). For us, this is often beneficial, because most numbers are small. Generally, however, this branch might be hard to predict. I'd be very careful with inserting data-dependent branches without a thorough analysis with different workloads on different micro-architectures.
- We don't specialize for the case where just one upper half is zero. This happens too rarely to be worth the extra code size, compile time, and possible branch misses.
- For the complex case (one of the high parts is non-zero), we make an out-of-line call and do not use LLVM's inline expansion. This saves on code size and compile time; optimization-wise the possible gains are very low anyway: there's (almost) nothing that can be folded or hoisted and the code needs a lot of registers anyway.

For x86-64, the full out-of-line function is hand-optimized assembly code, which performs slightly better than LLVM's expansion (e.g., uses one less register, exploits some x86-specific flags tricks, and has optimized scheduling for some recent uArches) (also note: GCC's expansion is (was?) horrible with lots of data-dependent branches). For AArch64, we use the compiler-rt function, we haven't felt the need to look closer into this so far (most of our benchmarks target x86-64 :-) ).

https://github.com/llvm/llvm-project/pull/148343