[PATCH] D38128: Handle COPYs of physregs better (regalloc hints)

Mon Oct 16 14:09:24 PDT 2017

efriedma added inline comments.

================
Comment at: test/CodeGen/ARM/swifterror.ll:350
+; CHECK-APPLE: mov r0, r8
+; CHECK-APPLE: cmp r0, #0
 ; Access part of the error object and save it to error_ref
----------------
jonpa wrote:
> efriedma wrote:
> > This is... not really great.  I mean, it's the same number of instructions, but you're increasing the latency by making the cmp refer to r0 rather than r8.  Do you know why this is happening?
> master                                                         patched
> 
> Register allocation input:
> 
> 
> ```
> ********** MACHINEINSTRS **********                             ********** MACHINEINSTRS **********
> ...
> 224B            BL_pred <ga:@foo_vararg>, pred:14, pred:        224B            BL_pred <ga:@foo_vararg>, pred:14, pred:
> 240B            ADJCALLSTACKUP 0, 0, pred:14, pred:%nore        240B            ADJCALLSTACKUP 0, 0, pred:14, pred:%nore
> 256B            %vreg0<def> = COPY %R8<kill>; GPR:%vreg0        256B            %vreg0<def> = COPY %R8<kill>; GPR:%vreg0
> 304B            CMPri %vreg0, 0, pred:14, pred:%noreg, %        304B            CMPri %vreg0, 0, pred:14, pred:%noreg, %
> 320B            Bcc <BB#2>, pred:1, pred:%CPSR<kill>            320B            Bcc <BB#2>, pred:1, pred:%CPSR<kill>
> 336B            B <BB#1>                                        336B            B <BB#1>
>             Successors according to CFG: BB#2(0x50000000                    Successors according to CFG: BB#2(0x50000000
> 
> 352B    BB#1: derived from LLVM BB %cont                        352B    BB#1: derived from LLVM BB %cont
>             Predecessors according to CFG: BB#0                             Predecessors according to CFG: BB#0
> 368B            %vreg10<def> = LDRBi12 %vreg0, 8, pred:1        368B            %vreg10<def> = LDRBi12 %vreg0, 8, pred:1
> 384B            STRBi12 %vreg10, %vreg1, 0, pred:14, pre        384B            STRBi12 %vreg10, %vreg1, 0, pred:14, pre
>             Successors according to CFG: BB#2(?%)                           Successors according to CFG: BB#2(?%)
> 
> 400B    BB#2: derived from LLVM BB %handler                     400B    BB#2: derived from LLVM BB %handler
>             Predecessors according to CFG: BB#0 BB#1                        Predecessors according to CFG: BB#0 BB#1
> 416B            ADJCALLSTACKDOWN 0, 0, pred:14, pred:%no        416B            ADJCALLSTACKDOWN 0, 0, pred:14, pred:%no
> 432B            %R0<def> = COPY %vreg0; GPR:%vreg0              432B            %R0<def> = COPY %vreg0; GPR:%vreg0
> 448B            BL <ga:@free>, <regmask %LR %D8 %D9 %D10        448B            BL <ga:@free>, <regmask %LR %D8 %D9 %D10
> ...
> ```
> 
> 
> ```
> selectOrSplit GPR:%vreg0 [256r,432r:0)  0 at 256r w=5.92840        selectOrSplit GPR:%vreg0 [256r,432r:0)  0 at 256r w=5.92840
> hints: %R8                                                 |    hints: %R0 %R8
> assigning %vreg0 to %R8: R8 [256r,432r:0)  0 at 256r          |    assigning %vreg0 to %R0: R0 [256r,432r:0)  0 at 256r
> 
> ```
> %vreg0 now has two COPY hints, and I am guessing that they have the same weight, but for no reason %R0 is hinted before %R0, while just %R8 is hinted on master.
> 
> 
> ```
> ********** REWRITE VIRTUAL REGISTERS **********                 ********** REWRITE VIRTUAL REGISTERS **********
> ********** Function: caller4                                    ********** Function: caller4
> ********** REGISTER MAP **********                              ********** REGISTER MAP **********
> [%vreg0 -> %R8] GPR                                        |    [%vreg0 -> %R0] GPR
> ...
> 
> ```
> Not sure if it is clear that coalescing with %R8 is generally better than %R0.
> 
> 
> ```
> # After Thumb2 instruction size reduction pass:                 # After Thumb2 instruction size reduction pass:
> 
> BB#0: derived from LLVM BB %entry                               BB#0: derived from LLVM BB %entry
>     Live Ins: %R0 %R8 %R4 %LR                                       Live Ins: %R0 %R8 %R4 %LR
> ...
>         %R2<def> = MOVi 12, pred:14, pred:%noreg, opt:%n                %R2<def> = MOVi 12, pred:14, pred:%noreg, opt:%n
>         BL_pred <ga:@foo_vararg>, pred:14, pred:%noreg,    |            BL_pred <ga:@foo_vararg>, pred:14, pred:%noreg, 
>         CMPri %R8, 0, pred:14, pred:%noreg, %CPSR<imp-de   |            %R0<def> = MOVr %R8<kill>, pred:14, pred:%noreg,
>                                                            >            CMPri %R0, 0, pred:14, pred:%noreg, %CPSR<imp-de
>         Bcc <BB#2>, pred:1, pred:%CPSR<kill>                            Bcc <BB#2>, pred:1, pred:%CPSR<kill>
>     Successors according to CFG: BB#2(0x50000000 / 0x800            Successors according to CFG: BB#2(0x50000000 / 0x800
> 
> BB#1: derived from LLVM BB %cont                                BB#1: derived from LLVM BB %cont
>     Live Ins: %R4 %R8                                      |        Live Ins: %R0 %R4
>     Predecessors according to CFG: BB#0                             Predecessors according to CFG: BB#0
>         %R0<def> = LDRBi12 %R8, 8, pred:14, pred:%noreg;   |            %R1<def> = LDRBi12 %R0, 8, pred:14, pred:%noreg;
>         STRBi12 %R0<kill>, %R4<kill>, 0, pred:14, pred:%   |            STRBi12 %R1<kill>, %R4<kill>, 0, pred:14, pred:%
>     Successors according to CFG: BB#2(?%)                           Successors according to CFG: BB#2(?%)
> 
> BB#2: derived from LLVM BB %handler                             BB#2: derived from LLVM BB %handler
>     Live Ins: %R8                                          |        Live Ins: %R0
>     Predecessors according to CFG: BB#0 BB#1                        Predecessors according to CFG: BB#0 BB#1
>         %R0<def> = MOVr %R8<kill>, pred:14, pred:%noreg,   <
>         BL <ga:@free>, <regmask %LR %D8 %D9 %D10 %D11 %D                BL <ga:@free>, <regmask %LR %D8 %D9 %D10 %D11 %D
> ...
> 
> # After If Converter:                                           # After If Converter:
> 
> BB#0: derived from LLVM BB %entry                               BB#0: derived from LLVM BB %entry
> ...
>         %R2<def> = MOVi 12, pred:14, pred:%noreg, opt:%n                %R2<def> = MOVi 12, pred:14, pred:%noreg, opt:%n
>         BL_pred <ga:@foo_vararg>, pred:14, pred:%noreg,                 BL_pred <ga:@foo_vararg>, pred:14, pred:%noreg, 
>         CMPri %R8, 0, pred:14, pred:%noreg, %CPSR<imp-de   <
>         %R0<def> = LDRBi12 %R8, 8, pred:0, pred:%CPSR; m   <
>         STRBi12 %R0<kill>, %R4<kill>, 0, pred:0, pred:%C   <
>         %R0<def> = MOVr %R8<kill>, pred:14, pred:%noreg,                %R0<def> = MOVr %R8<kill>, pred:14, pred:%noreg,
>                                                            >            CMPri %R0, 0, pred:14, pred:%noreg, %CPSR<imp-de
>                                                            >            %R1<def> = LDRBi12 %R0, 8, pred:0, pred:%CPSR; m
>                                                            >            STRBi12 %R1<kill>, %R4<kill>, 0, pred:0, pred:%C
>         BL <ga:@free>, <regmask %LR %D8 %D9 %D10 %D11 %D                BL <ga:@free>, <regmask %LR %D8 %D9 %D10 %D11 %D
>         %R0<def> = MOVi 1065353216, pred:14, pred:%noreg                %R0<def> = MOVi 1065353216, pred:14, pred:%noreg
>         %SP<def> = ADDri %SP<kill>, 16, pred:14, pred:%n                %SP<def> = ADDri %SP<kill>, 16, pred:14, pred:%n
>         %SP<def,tied1> = LDMIA_RET %SP<tied0>, pred:14,                 %SP<def,tied1> = LDMIA_RET %SP<tied0>, pred
> 
> _caller4:                               _caller4:
> @ BB#0:                                 @ BB#0:                              
>         push    {r4, r8, lr}                    push    {r4, r8, lr}
>         sub     sp, sp, #16                     sub     sp, sp, #16
>         mov     r4, r0                          mov     r4, r0
>         mov     r0, #11                         mov     r0, #11
>         str     r0, [sp, #4]                    str     r0, [sp, #4]
>         mov     r0, #10                         mov     r0, #10
>         str     r0, [sp, #8]                    str     r0, [sp, #8]
>         mov     r0, #12                         mov     r0, #12
>         str     r0, [sp]                        str     r0, [sp]
>         mov     r8, #0                          mov     r8, #0
>         mov     r0, #10                         mov     r0, #10
>         mov     r1, #11                         mov     r1, #11
>         mov     r2, #12                         mov     r2, #12
>         bl      _foo_vararg                     bl      _foo_vararg
>         cmp     r8, #0                <
>         ldrbeq  r0, [r8, #8]          <
>         strbeq  r0, [r4]              <
>         mov     r0, r8                          mov     r0, r8
>                                       >         cmp     r0, #0
>                                       >         ldrbeq  r1, [r0, #8]
>                                       >         strbeq  r1, [r4]
>         bl      _free                           bl      _free
>         mov     r0, #1065353216                 mov     r0, #1065353216
>         add     sp, sp, #16                     add     sp, sp, #16
>         pop     {r4, r8, pc}                    pop     {r4, r8, pc}
> ```
> 
mov+cmp generally has worse latency than cmp+mov on superscalar CPUs, unless you're working with something like very recent x86 CPUs which have tricks to hide the cost.

I don't know enough about the register allocator to say if there's some existing code that's supposed to handle this sort of thing.

https://reviews.llvm.org/D38128