[llvm] [AArch64] Enable CmpBcc fusion for Neoverse-v2 (PR #90608)

Fri May 3 00:25:21 PDT 2024

davemgreen wrote:

I believe that changes like this would be perfectly fine, and generally be an improvement:
```

--- a/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-scalable.ll                               
+++ b/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-scalable.ll                               
@@ -340,10 +340,10 @@ define dso_local %"class.std::complex" @reduction_mix(ptr %a, ptr %b, ptr noalia      
 ; CHECK-NEXT:    add x0, x0, x11                                                                           
 ; CHECK-NEXT:    ld1w { z5.d }, p0/z, [x3, x8, lsl #2]                                                     
 ; CHECK-NEXT:    add x8, x8, x9                                                                            
-; CHECK-NEXT:    cmp x10, x8                                                                               
 ; CHECK-NEXT:    fadd z0.d, z4.d, z0.d                                                                     
 ; CHECK-NEXT:    fadd z1.d, z3.d, z1.d                                                                     
 ; CHECK-NEXT:    add z2.d, z5.d, z2.d                                                                      
+; CHECK-NEXT:    cmp x10, x8                                                                               
 ; CHECK-NEXT:    b.ne .LBB3_1                                                                              
 ; CHECK-NEXT:  // %bb.2: // %middle.block                                                                  
 ; CHECK-NEXT:    uaddv d2, p0, z2.d                                                                        
```

It is the cases where we end up spilling extra registers or just end up with extra instructions that worry me:
```
--- a/llvm/test/CodeGen/AArch64/setcc_knownbits.ll                                                    
+++ b/llvm/test/CodeGen/AArch64/setcc_knownbits.ll                                                    
@@ -27,9 +27,10 @@ define noundef i1 @logger(i32 noundef %logLevel, ptr %ea, ptr %pll) {              
 ;                                                                                                    
 ; CHECK-GI-LABEL: logger:                                                                            
 ; CHECK-GI:       // %bb.0: // %entry                                                                
-; CHECK-GI-NEXT:    ldr w8, [x2]                                                                     
-; CHECK-GI-NEXT:    cmp w8, w0                                                                       
+; CHECK-GI-NEXT:    ldr w9, [x2]                                                                     
+; CHECK-GI-NEXT:    mov w8, w0                                                                       
 ; CHECK-GI-NEXT:    mov w0, wzr                                                                      
+; CHECK-GI-NEXT:    cmp w9, w8                                                                       
 ; CHECK-GI-NEXT:    b.hi .LBB1_2                                                                     
 ; CHECK-GI-NEXT:  // %bb.1: // %land.rhs                                                             
 ; CHECK-GI-NEXT:    ldr x8, [x1]                                                                     
```

I'm not against this change if you have ran plenty of benchmarks and the performance overall is improving. It is certainly a sensible thing to do on paper. Maybe the example above isn't great (it's just a mov) but hopefully you get the idea about it potentially causing extra register pressure and extra spills. When I compile the llvm-test-suite + 3 versions of spec, notably the _instruction_ count overall increases by 0.1%. Everything changes so there might be some noise going on, but ideally scheduling changes wouldn't increase the instruction count. (The codesize is also slightly increasing).

I was wondering if there was a slightly less aggressive version of macro-fusing that we could come up with that got the benefits without the increases in register pressure. It looks like it already tries to treat it that way in the scheduler at the moment though, with this not being the top priority inside GenericScheduler::tryCandidate. Fusing to the ExitSU might be causing difficulties.

So I think this change is OK so long as you have ran plenty of benchmarks. The results I have are not better, but not a lot worse and it could be chalked up be within the noise margin given how many things change. It might be worth investigating if there is a weaker form we could invent that becomes more aggressive again in post-ra.

https://github.com/llvm/llvm-project/pull/90608