[llvm] [CGP]: Optimize mul.overflow. (PR #148343)

Fri Aug 1 23:12:50 PDT 2025

================
@@ -395,7 +446,21 @@ define { i128, i8 } @i128_checked_mul(i128 %x, i128 %y) {
 
 define { i128, i8 } @i128_overflowing_mul(i128 %x, i128 %y) {
 ; CHECK-LABEL: i128_overflowing_mul:
-; CHECK:       // %bb.0:
+; CHECK:       // %bb.0: // %overflow.entry
+; CHECK-NEXT:    cmp x1, x0, asr #63
+; CHECK-NEXT:    b.ne .LBB22_3
+; CHECK-NEXT:  // %bb.1: // %overflow.entry
+; CHECK-NEXT:    asr x8, x2, #63
+; CHECK-NEXT:    cmp x3, x8
+; CHECK-NEXT:    b.ne .LBB22_3
----------------
aengelke wrote:

1. Consider the case where the first operand is sometimes small and sometimes large but the second operand is always large. With this implementation, the first branch is likely mispredicted, although in the end it's always the slow path that gets executed.
2. Many out-of-order CPUs (e.g., recent Apple CPUs) can't decode across branches in the same cycle. Having just two instructions before a conditional branch reduces the throughput of the front-end.
3. I'm generally worried about introducing data-dependent branches that were not present in the original code, these can hurt performance when they are mispredicted frequently. We already make it hard for users to get branchless code if they want to (e.g., if they know that the condition is unpredictable) and we shouldn't add branches unless there's an extremely good reason.

https://github.com/llvm/llvm-project/pull/148343