[PATCH] D124743: [DAGCombine] Add node in the worklist in topological order in CombineTo

Amaury SECHET via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Fri May 6 17:16:15 PDT 2022


deadalnix added inline comments.


================
Comment at: llvm/test/CodeGen/AArch64/arm64-abi-varargs.ll:23
+; CHECK-NEXT:    str w9, [sp, #16]
+; CHECK-NEXT:    ldr w9, [x8], #8
 ; CHECK-NEXT:    str x8, [sp, #24]
----------------
Pre DAG:
```
SelectionDAG has 55 nodes:
  t0: ch = EntryToken
  t19: i32,ch = load<(load (s32) from %fixed-stack.1, align 16)> t0, FrameIndex:i64<-1>, undef:i64
    t2: i64,ch = CopyFromReg t0, Register:i64 %0
  t51: ch = store<(store (s32) into %ir.a1)> t19:1, t19, t2, undef:i64
  t98: i32,ch = load<(load (s32) from %fixed-stack.0, align 8)> t51, FrameIndex:i64<-2>, undef:i64
    t84: i64 = add FrameIndex:i64<-2>, Constant:i64<8>
  t109: i32,ch = load<(load (s32) from %fixed-stack.0 + 8, align 8)> t51, t84, undef:i64
    t108: i64 = add FrameIndex:i64<-2>, Constant:i64<16>
  t121: i32,ch = load<(load (s32) from %fixed-stack.0 + 16, align 8)> t51, t108, undef:i64
      t128: ch = store<(store (s32) into %ir.a12)> t0, t121, FrameIndex:i64<12>, undef:i64
        t120: i64 = add FrameIndex:i64<-2>, Constant:i64<24>
      t116: ch = store<(store (s64) into %ir.args)> t0, t120, FrameIndex:i64<9>, undef:i64
      t124: ch = store<(store (s32) into %ir.a11)> t0, t109, FrameIndex:i64<11>, undef:i64
      t131: ch = store<(store (s32) into %ir.a10)> t0, t98, FrameIndex:i64<10>, undef:i64
        t16: i32,ch = CopyFromReg t0, Register:i32 %7
      t53: ch = store<(store (s32) into %ir.8)> t0, t16, FrameIndex:i64<7>, undef:i64
        t14: i32,ch = CopyFromReg t0, Register:i32 %6
      t56: ch = store<(store (s32) into %ir.7)> t0, t14, FrameIndex:i64<6>, undef:i64
        t12: i32,ch = CopyFromReg t0, Register:i32 %5
      t59: ch = store<(store (s32) into %ir.6)> t0, t12, FrameIndex:i64<5>, undef:i64
        t10: i32,ch = CopyFromReg t0, Register:i32 %4
      t62: ch = store<(store (s32) into %ir.5)> t0, t10, FrameIndex:i64<4>, undef:i64
        t8: i32,ch = CopyFromReg t0, Register:i32 %3
      t65: ch = store<(store (s32) into %ir.4)> t0, t8, FrameIndex:i64<3>, undef:i64
        t6: i32,ch = CopyFromReg t0, Register:i32 %2
      t68: ch = store<(store (s32) into %ir.3)> t0, t6, FrameIndex:i64<2>, undef:i64
        t4: i32,ch = CopyFromReg t0, Register:i32 %1
      t71: ch = store<(store (s32) into %ir.2)> t0, t4, FrameIndex:i64<1>, undef:i64
    t134: ch = TokenFactor t128, t121:1, t116, t124, t109:1, t131, t98:1, t53, t56, t59, t62, t65, t68, t71
  t50: ch = AArch64ISD::RET_FLAG t134
```

Post DAG:
```
SelectionDAG has 51 nodes:
  t0: ch = EntryToken
  t19: i32,ch = load<(load (s32) from %fixed-stack.1, align 16)> t0, FrameIndex:i64<-1>, undef:i64
    t2: i64,ch = CopyFromReg t0, Register:i64 %0
  t51: ch = store<(store (s32) into %ir.a1)> t19:1, t19, t2, undef:i64
  t98: i32,ch = load<(load (s32) from %fixed-stack.0, align 8)> t51, FrameIndex:i64<-2>, undef:i64
    t84: i64 = add FrameIndex:i64<-2>, Constant:i64<8>
  t109: i32,i64,ch = load<(load (s32) from %fixed-stack.0 + 8, align 8), <post-inc>> t51, t84, Constant:i64<8>
  t120: i32,i64,ch = load<(load (s32)), <post-inc>> t51, t109:1, Constant:i64<8>
      t126: ch = store<(store (s32) into %ir.a12)> t0, t120, FrameIndex:i64<12>, undef:i64
      t115: ch = store<(store (s64) into %ir.args)> t0, t120:1, FrameIndex:i64<9>, undef:i64
      t122: ch = store<(store (s32) into %ir.a11)> t0, t109, FrameIndex:i64<11>, undef:i64
      t129: ch = store<(store (s32) into %ir.a10)> t0, t98, FrameIndex:i64<10>, undef:i64
        t16: i32,ch = CopyFromReg t0, Register:i32 %7
      t53: ch = store<(store (s32) into %ir.8)> t0, t16, FrameIndex:i64<7>, undef:i64
        t14: i32,ch = CopyFromReg t0, Register:i32 %6
      t56: ch = store<(store (s32) into %ir.7)> t0, t14, FrameIndex:i64<6>, undef:i64
        t12: i32,ch = CopyFromReg t0, Register:i32 %5
      t59: ch = store<(store (s32) into %ir.6)> t0, t12, FrameIndex:i64<5>, undef:i64
        t10: i32,ch = CopyFromReg t0, Register:i32 %4
      t62: ch = store<(store (s32) into %ir.5)> t0, t10, FrameIndex:i64<4>, undef:i64
        t8: i32,ch = CopyFromReg t0, Register:i32 %3
      t65: ch = store<(store (s32) into %ir.4)> t0, t8, FrameIndex:i64<3>, undef:i64
        t6: i32,ch = CopyFromReg t0, Register:i32 %2
      t68: ch = store<(store (s32) into %ir.3)> t0, t6, FrameIndex:i64<2>, undef:i64
        t4: i32,ch = CopyFromReg t0, Register:i32 %1
      t71: ch = store<(store (s32) into %ir.2)> t0, t4, FrameIndex:i64<1>, undef:i64
    t132: ch = TokenFactor t126, t120:2, t115, t122, t109:2, t129, t98:1, t53, t56, t59, t62, t65, t68, t71
  t50: ch = AArch64ISD::RET_FLAG t132
```

It is hard to argue with this one that the DAG after this change to the combiner is worse, it definitively seems better to me.


================
Comment at: llvm/test/CodeGen/AArch64/swifterror.ll:946
+; CHECK-APPLE-AARCH64-NEXT:    str w8, [sp, #16]
+; CHECK-APPLE-AARCH64-NEXT:    ldr w8, [x9], #8
 ; CHECK-APPLE-AARCH64-NEXT:    fmov s0, #1.00000000
----------------
Pre DAG:
```
SelectionDAG has 41 nodes:
      t0: ch = EntryToken
    t6: ch,glue = callseq_start t0, TargetConstant:i64<0>, TargetConstant:i64<0>
  t10: ch,glue = CopyToReg t6, Register:i64 $x0, Constant:i64<16>
  t13: ch,glue = AArch64ISD::CALL t10, TargetGlobalAddress:i64<i8* (i64)* @malloc> 0, Register:i64 $x0, RegisterMask:Untyped, t10:1
  t14: ch,glue = callseq_end t13, TargetConstant:i64<0>, TargetConstant:i64<0>, t13:1
  t15: i64,ch,glue = CopyFromReg t14, Register:i64 $x0, t14:1
  t17: ch = CopyToReg t15:1, Register:i64 %1, t15
    t19: i64 = add nuw t15, Constant:i64<8>
  t45: ch = store<(store (s8) into %ir.tmp), trunc to i8> t17, Constant:i32<1>, t19, undef:i64
      t100: ch = store<(store (s32) into %ir.a12)> t17, t97, FrameIndex:i64<3>, undef:i64
        t95: i64 = add FrameIndex:i64<-1>, Constant:i64<24>
      t90: ch = store<(store (s64) into %ir.args)> t17, t95, FrameIndex:i64<0>, undef:i64
      t103: ch = store<(store (s32) into %ir.a11)> t17, t83, FrameIndex:i64<2>, undef:i64
      t107: ch = store<(store (s32) into %ir.a10)> t17, t71, FrameIndex:i64<1>, undef:i64
    t110: ch = TokenFactor t100, t97:1, t90, t103, t83:1, t107, t71:1
  t40: ch,glue = CopyToReg t110, Register:f32 $s0, ConstantFP:f32<1.000000e+00>
  t42: ch,glue = CopyToReg t40, Register:i64 $x21, Register:i64 %1, t40:1
  t71: i32,ch = load<(load (s32) from %fixed-stack.0, align 16)> t45, FrameIndex:i64<-1>, undef:i64
    t69: i64 = or FrameIndex:i64<-1>, Constant:i64<8>
  t83: i32,ch = load<(load (s32), align 8)> t45, t69, undef:i64
    t81: i64 = add FrameIndex:i64<-1>, Constant:i64<16>
  t97: i32,ch = load<(load (s32) from %fixed-stack.0 + 16, align 16)> t45, t81, undef:i64
  t43: ch = AArch64ISD::RET_FLAG t42, Register:f32 $s0, Register:i64 $x21, t42:1
```

Post DAG:
```
SelectionDAG has 38 nodes:
      t0: ch = EntryToken
    t6: ch,glue = callseq_start t0, TargetConstant:i64<0>, TargetConstant:i64<0>
  t10: ch,glue = CopyToReg t6, Register:i64 $x0, Constant:i64<16>
  t13: ch,glue = AArch64ISD::CALL t10, TargetGlobalAddress:i64<i8* (i64)* @malloc> 0, Register:i64 $x0, RegisterMask:Untyped, t10:1
  t14: ch,glue = callseq_end t13, TargetConstant:i64<0>, TargetConstant:i64<0>, t13:1
  t15: i64,ch,glue = CopyFromReg t14, Register:i64 $x0, t14:1
  t17: ch = CopyToReg t15:1, Register:i64 %1, t15
    t19: i64 = add nuw t15, Constant:i64<8>
  t45: ch = store<(store (s8) into %ir.tmp), trunc to i8> t17, Constant:i32<1>, t19, undef:i64
      t99: ch = store<(store (s32) into %ir.a12)> t17, t97, FrameIndex:i64<3>, undef:i64
      t90: ch = store<(store (s64) into %ir.args)> t17, t97:1, FrameIndex:i64<0>, undef:i64
      t102: ch = store<(store (s32) into %ir.a11)> t17, t84, FrameIndex:i64<2>, undef:i64
      t106: ch = store<(store (s32) into %ir.a10)> t17, t70, FrameIndex:i64<1>, undef:i64
    t109: ch = TokenFactor t99, t97:2, t90, t102, t84:2, t106, t70:1
  t40: ch,glue = CopyToReg t109, Register:f32 $s0, ConstantFP:f32<1.000000e+00>
  t42: ch,glue = CopyToReg t40, Register:i64 $x21, Register:i64 %1, t40:1
  t70: i32,ch = load<(load (s32) from %fixed-stack.0, align 16)> t45, FrameIndex:i64<-1>, undef:i64
    t73: i64 = or FrameIndex:i64<-1>, Constant:i64<8>
  t84: i32,i64,ch = load<(load (s32), align 8), <post-inc>> t45, t73, Constant:i64<8>
  t97: i32,i64,ch = load<(load (s32)), <post-inc>> t45, t84:1, Constant:i64<8>
  t43: ch = AArch64ISD::RET_FLAG t42, Register:f32 $s0, Register:i64 $x21, t42:1
```

Once again, this looks like it is improving things at the DAGCombine level, but trips up later stages in the backend.


================
Comment at: llvm/test/CodeGen/ARM/swifterror.ll:687
+; CHECK-APPLE-NEXT:    add r0, r0, #4
+; CHECK-APPLE-NEXT:    ldr r2, [r7, #8]
+; CHECK-APPLE-NEXT:    ldr r1, [r0], #4
----------------
efriedma wrote:
> deadalnix wrote:
> > deadalnix wrote:
> > > RKSimon wrote:
> > > > This looks like another lost multiple load/store ?
> > > Yes, this looks similar to what is happening to arm64-abi-varargs.ll .
> > Investigating this a bit more, it looks like we get some crazy combine happening when CombineTo work in topological order that are missed otherwise.
> > 
> > The dag looks like this with topological CombineTo:
> > ```
> > SelectionDAG has 58 nodes:
> >   t0: ch = EntryToken
> >   t5: i32,ch = CopyFromReg t0, Register:i32 %1
> >   t9: i32 = add FrameIndex:i32<-1>, Constant:i32<4>
> >   t11: i32,ch = CopyFromReg t0, Register:i32 %2
> >   t15: i32,ch = CopyFromReg t0, Register:i32 %3
> >   t19: i32,ch = CopyFromReg t0, Register:i32 %4
> >           t73: i32 = add FrameIndex:i32<-1>, Constant:i32<12>
> >         t68: ch = store<(store (s32) into unknown-address + 12)> t0, t19, t73, undef:i32
> >           t77: i32 = add FrameIndex:i32<-1>, Constant:i32<8>
> >         t74: ch = store<(store (s32) into unknown-address + 8)> t0, t15, t77, undef:i32
> >         t78: ch = store<(store (s32) into %fixed-stack.0 + 4)> t0, t11, t9, undef:i32
> >         t81: ch = store<(store (s32) into %fixed-stack.0)> t0, t5, FrameIndex:i32<-1>, undef:i32
> >       t83: ch = TokenFactor t19:1, t68, t15:1, t74, t11:1, t78, t5:1, t81
> >     t29: ch,glue = callseq_start t83, TargetConstant:i32<0>, TargetConstant:i32<0>
> >   t33: ch,glue = CopyToReg t29, Register:i32 $r0, Constant:i32<16>
> >   t35: ch,glue = CopyToReg t33, Register:i32 $r1, Constant:i32<0>, t33:1
> >   t38: ch,glue = ARMISD::CALL t35, TargetGlobalAddress:i32<i8* (i64)* @malloc> 0, Register:i32 $r0, Register:i32 $r1, RegisterMask:Untyped, t35:1
> >   t40: ch,glue = callseq_end t38, TargetConstant:i32<0>, TargetConstant:i32<-1>, t38:1
> >   t41: i32,ch,glue = CopyFromReg t40, Register:i32 $r0, t40:1
> >   t43: ch = CopyToReg t41:1, Register:i32 %5, t41
> >     t45: i32 = add nuw t41, Constant:i32<8>
> >   t85: ch = store<(store (s8) into %ir.tmp), trunc to i8> t43, Constant:i32<1>, t45, undef:i32
> >       t137: ch = store<(store (s32) into %ir.a12)> t43, t135, FrameIndex:i32<3>, undef:i32
> >       t128: ch = store<(store (s32) into %ir.args, align 8)> t43, t135:1, FrameIndex:i32<0>, undef:i32
> >       t140: ch = store<(store (s32) into %ir.a11)> t43, t122, FrameIndex:i32<2>, undef:i32
> >       t144: ch = store<(store (s32) into %ir.a10)> t43, t109, FrameIndex:i32<1>, undef:i32
> >     t147: ch = TokenFactor t137, t135:2, t128, t140, t122:2, t144, t109:1
> >   t64: ch,glue = CopyToReg t147, Register:i32 $r0, Constant:i32<1065353216>
> >   t66: ch,glue = CopyToReg t64, Register:i32 $r8, Register:i32 %5, t64:1
> >   t109: i32,ch = load<(load (s32))> t85, FrameIndex:i32<-1>, undef:i32
> >   t122: i32,i32,ch = load<(load (s32) from %fixed-stack.0 + 8), <post-inc>> t85, t9, Constant:i32<4>
> >   t135: i32,i32,ch = load<(load (s32)), <post-inc>> t85, t122:1, Constant:i32<4>
> >   t67: ch = ARMISD::RET_FLAG t66, Register:i32 $r0, Register:i32 $r8, t66:1
> > ```
> > 
> > And without:
> > ```
> > SelectionDAG has 58 nodes:
> >   t0: ch = EntryToken
> >   t5: i32,ch = CopyFromReg t0, Register:i32 %1
> >   t9: i32 = add FrameIndex:i32<-1>, Constant:i32<4>
> >   t11: i32,ch = CopyFromReg t0, Register:i32 %2
> >   t15: i32,ch = CopyFromReg t0, Register:i32 %3
> >   t19: i32,ch = CopyFromReg t0, Register:i32 %4
> >   t77: i32 = add FrameIndex:i32<-1>, Constant:i32<8>
> >   t73: i32 = add FrameIndex:i32<-1>, Constant:i32<12>
> >         t68: ch = store<(store (s32) into unknown-address + 12)> t0, t19, t73, undef:i32
> >         t74: ch = store<(store (s32) into unknown-address + 8)> t0, t15, t77, undef:i32
> >         t78: ch = store<(store (s32) into %fixed-stack.0 + 4)> t0, t11, t9, undef:i32
> >         t81: ch = store<(store (s32) into %fixed-stack.0)> t0, t5, FrameIndex:i32<-1>, undef:i32
> >       t83: ch = TokenFactor t19:1, t68, t15:1, t74, t11:1, t78, t5:1, t81
> >     t29: ch,glue = callseq_start t83, TargetConstant:i32<0>, TargetConstant:i32<0>
> >   t33: ch,glue = CopyToReg t29, Register:i32 $r0, Constant:i32<16>
> >   t35: ch,glue = CopyToReg t33, Register:i32 $r1, Constant:i32<0>, t33:1
> >   t38: ch,glue = ARMISD::CALL t35, TargetGlobalAddress:i32<i8* (i64)* @malloc> 0, Register:i32 $r0, Register:i32 $r1, RegisterMask:Untyped, t35:1
> >   t40: ch,glue = callseq_end t38, TargetConstant:i32<0>, TargetConstant:i32<-1>, t38:1
> >   t41: i32,ch,glue = CopyFromReg t40, Register:i32 $r0, t40:1
> >   t43: ch = CopyToReg t41:1, Register:i32 %5, t41
> >     t45: i32 = add nuw t41, Constant:i32<8>
> >   t85: ch = store<(store (s8) into %ir.tmp), trunc to i8> t43, Constant:i32<1>, t45, undef:i32
> >       t135: ch = store<(store (s32) into %ir.a12)> t43, t132, FrameIndex:i32<3>, undef:i32
> >       t127: ch = store<(store (s32) into %ir.args, align 8)> t43, t73, FrameIndex:i32<0>, undef:i32
> >       t138: ch = store<(store (s32) into %ir.a11)> t43, t120, FrameIndex:i32<2>, undef:i32
> >       t142: ch = store<(store (s32) into %ir.a10)> t43, t109, FrameIndex:i32<1>, undef:i32
> >     t145: ch = TokenFactor t135, t132:1, t127, t138, t120:1, t142, t109:1
> >   t64: ch,glue = CopyToReg t145, Register:i32 $r0, Constant:i32<1065353216>
> >   t66: ch,glue = CopyToReg t64, Register:i32 $r8, Register:i32 %5, t64:1
> >   t109: i32,ch = load<(load (s32))> t85, FrameIndex:i32<-1>, undef:i32
> >   t120: i32,ch = load<(load (s32))> t85, t9, undef:i32
> >   t132: i32,ch = load<(load (s32))> t85, t77, undef:i32
> >   t67: ch = ARMISD::RET_FLAG t66, Register:i32 $r0, Register:i32 $r8, t66:1
> > ```
> > 
> > The meaningful change seems to be these crazy loads:
> > ```
> >   t122: i32,i32,ch = load<(load (s32) from %fixed-stack.0 + 8), <post-inc>> t85, t9, Constant:i32<4>
> >   t135: i32,i32,ch = load<(load (s32)), <post-inc>> t85, t122:1, Constant:i32<4>
> > ```
> > 
> > It'd be really helpful to have some ARM expert here to help us knowing what the right path forward is.
> We fail to form a post-inc load, which changes the register allocation, which blocks use of "ldm".
> 
> Post-inc load formation failing to trigger is a bit concerning; makes me think we somehow aren't running the relevant combine late enough.  DAGCombiner::CombineToPreIndexedLoadStore is a target-independent combine with a bit of target-specific code; see getPreIndexedAddressParts().
> 
> "ldm" formation is unreliable on ARM because it runs after register allocation; I wouldn't worry about that part.
I have to apologize because I now notice a major error in my previous message. I swapped the two DAGs. The post-inc loads are in this diff's version of the DAG, and they do not exists on `main`. This has typically be the pattern with these changes: they allow DAGCombine to do more, but sometime, later stages in the pipeline gets confused. Nevertheless, if ldm is combine post register allocation, then it is indeed going to be fragile.

When you say that you wouldn't worry about this, do you mean that, because ldm is generally unreliable (as it happen post register allocation), then we should proceed regardless of this specific change, assuming there is no other outstanding open question?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124743/new/

https://reviews.llvm.org/D124743



More information about the llvm-commits mailing list