[PATCH] D124743: [DAGCombine] Add node in the worklist in topological order in CombineTo
Amaury SECHET via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Fri May 6 17:16:15 PDT 2022
deadalnix added inline comments.
================
Comment at: llvm/test/CodeGen/AArch64/arm64-abi-varargs.ll:23
+; CHECK-NEXT: str w9, [sp, #16]
+; CHECK-NEXT: ldr w9, [x8], #8
; CHECK-NEXT: str x8, [sp, #24]
----------------
Pre DAG:
```
SelectionDAG has 55 nodes:
t0: ch = EntryToken
t19: i32,ch = load<(load (s32) from %fixed-stack.1, align 16)> t0, FrameIndex:i64<-1>, undef:i64
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t51: ch = store<(store (s32) into %ir.a1)> t19:1, t19, t2, undef:i64
t98: i32,ch = load<(load (s32) from %fixed-stack.0, align 8)> t51, FrameIndex:i64<-2>, undef:i64
t84: i64 = add FrameIndex:i64<-2>, Constant:i64<8>
t109: i32,ch = load<(load (s32) from %fixed-stack.0 + 8, align 8)> t51, t84, undef:i64
t108: i64 = add FrameIndex:i64<-2>, Constant:i64<16>
t121: i32,ch = load<(load (s32) from %fixed-stack.0 + 16, align 8)> t51, t108, undef:i64
t128: ch = store<(store (s32) into %ir.a12)> t0, t121, FrameIndex:i64<12>, undef:i64
t120: i64 = add FrameIndex:i64<-2>, Constant:i64<24>
t116: ch = store<(store (s64) into %ir.args)> t0, t120, FrameIndex:i64<9>, undef:i64
t124: ch = store<(store (s32) into %ir.a11)> t0, t109, FrameIndex:i64<11>, undef:i64
t131: ch = store<(store (s32) into %ir.a10)> t0, t98, FrameIndex:i64<10>, undef:i64
t16: i32,ch = CopyFromReg t0, Register:i32 %7
t53: ch = store<(store (s32) into %ir.8)> t0, t16, FrameIndex:i64<7>, undef:i64
t14: i32,ch = CopyFromReg t0, Register:i32 %6
t56: ch = store<(store (s32) into %ir.7)> t0, t14, FrameIndex:i64<6>, undef:i64
t12: i32,ch = CopyFromReg t0, Register:i32 %5
t59: ch = store<(store (s32) into %ir.6)> t0, t12, FrameIndex:i64<5>, undef:i64
t10: i32,ch = CopyFromReg t0, Register:i32 %4
t62: ch = store<(store (s32) into %ir.5)> t0, t10, FrameIndex:i64<4>, undef:i64
t8: i32,ch = CopyFromReg t0, Register:i32 %3
t65: ch = store<(store (s32) into %ir.4)> t0, t8, FrameIndex:i64<3>, undef:i64
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t68: ch = store<(store (s32) into %ir.3)> t0, t6, FrameIndex:i64<2>, undef:i64
t4: i32,ch = CopyFromReg t0, Register:i32 %1
t71: ch = store<(store (s32) into %ir.2)> t0, t4, FrameIndex:i64<1>, undef:i64
t134: ch = TokenFactor t128, t121:1, t116, t124, t109:1, t131, t98:1, t53, t56, t59, t62, t65, t68, t71
t50: ch = AArch64ISD::RET_FLAG t134
```
Post DAG:
```
SelectionDAG has 51 nodes:
t0: ch = EntryToken
t19: i32,ch = load<(load (s32) from %fixed-stack.1, align 16)> t0, FrameIndex:i64<-1>, undef:i64
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t51: ch = store<(store (s32) into %ir.a1)> t19:1, t19, t2, undef:i64
t98: i32,ch = load<(load (s32) from %fixed-stack.0, align 8)> t51, FrameIndex:i64<-2>, undef:i64
t84: i64 = add FrameIndex:i64<-2>, Constant:i64<8>
t109: i32,i64,ch = load<(load (s32) from %fixed-stack.0 + 8, align 8), <post-inc>> t51, t84, Constant:i64<8>
t120: i32,i64,ch = load<(load (s32)), <post-inc>> t51, t109:1, Constant:i64<8>
t126: ch = store<(store (s32) into %ir.a12)> t0, t120, FrameIndex:i64<12>, undef:i64
t115: ch = store<(store (s64) into %ir.args)> t0, t120:1, FrameIndex:i64<9>, undef:i64
t122: ch = store<(store (s32) into %ir.a11)> t0, t109, FrameIndex:i64<11>, undef:i64
t129: ch = store<(store (s32) into %ir.a10)> t0, t98, FrameIndex:i64<10>, undef:i64
t16: i32,ch = CopyFromReg t0, Register:i32 %7
t53: ch = store<(store (s32) into %ir.8)> t0, t16, FrameIndex:i64<7>, undef:i64
t14: i32,ch = CopyFromReg t0, Register:i32 %6
t56: ch = store<(store (s32) into %ir.7)> t0, t14, FrameIndex:i64<6>, undef:i64
t12: i32,ch = CopyFromReg t0, Register:i32 %5
t59: ch = store<(store (s32) into %ir.6)> t0, t12, FrameIndex:i64<5>, undef:i64
t10: i32,ch = CopyFromReg t0, Register:i32 %4
t62: ch = store<(store (s32) into %ir.5)> t0, t10, FrameIndex:i64<4>, undef:i64
t8: i32,ch = CopyFromReg t0, Register:i32 %3
t65: ch = store<(store (s32) into %ir.4)> t0, t8, FrameIndex:i64<3>, undef:i64
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t68: ch = store<(store (s32) into %ir.3)> t0, t6, FrameIndex:i64<2>, undef:i64
t4: i32,ch = CopyFromReg t0, Register:i32 %1
t71: ch = store<(store (s32) into %ir.2)> t0, t4, FrameIndex:i64<1>, undef:i64
t132: ch = TokenFactor t126, t120:2, t115, t122, t109:2, t129, t98:1, t53, t56, t59, t62, t65, t68, t71
t50: ch = AArch64ISD::RET_FLAG t132
```
It is hard to argue with this one that the DAG after this change to the combiner is worse, it definitively seems better to me.
================
Comment at: llvm/test/CodeGen/AArch64/swifterror.ll:946
+; CHECK-APPLE-AARCH64-NEXT: str w8, [sp, #16]
+; CHECK-APPLE-AARCH64-NEXT: ldr w8, [x9], #8
; CHECK-APPLE-AARCH64-NEXT: fmov s0, #1.00000000
----------------
Pre DAG:
```
SelectionDAG has 41 nodes:
t0: ch = EntryToken
t6: ch,glue = callseq_start t0, TargetConstant:i64<0>, TargetConstant:i64<0>
t10: ch,glue = CopyToReg t6, Register:i64 $x0, Constant:i64<16>
t13: ch,glue = AArch64ISD::CALL t10, TargetGlobalAddress:i64<i8* (i64)* @malloc> 0, Register:i64 $x0, RegisterMask:Untyped, t10:1
t14: ch,glue = callseq_end t13, TargetConstant:i64<0>, TargetConstant:i64<0>, t13:1
t15: i64,ch,glue = CopyFromReg t14, Register:i64 $x0, t14:1
t17: ch = CopyToReg t15:1, Register:i64 %1, t15
t19: i64 = add nuw t15, Constant:i64<8>
t45: ch = store<(store (s8) into %ir.tmp), trunc to i8> t17, Constant:i32<1>, t19, undef:i64
t100: ch = store<(store (s32) into %ir.a12)> t17, t97, FrameIndex:i64<3>, undef:i64
t95: i64 = add FrameIndex:i64<-1>, Constant:i64<24>
t90: ch = store<(store (s64) into %ir.args)> t17, t95, FrameIndex:i64<0>, undef:i64
t103: ch = store<(store (s32) into %ir.a11)> t17, t83, FrameIndex:i64<2>, undef:i64
t107: ch = store<(store (s32) into %ir.a10)> t17, t71, FrameIndex:i64<1>, undef:i64
t110: ch = TokenFactor t100, t97:1, t90, t103, t83:1, t107, t71:1
t40: ch,glue = CopyToReg t110, Register:f32 $s0, ConstantFP:f32<1.000000e+00>
t42: ch,glue = CopyToReg t40, Register:i64 $x21, Register:i64 %1, t40:1
t71: i32,ch = load<(load (s32) from %fixed-stack.0, align 16)> t45, FrameIndex:i64<-1>, undef:i64
t69: i64 = or FrameIndex:i64<-1>, Constant:i64<8>
t83: i32,ch = load<(load (s32), align 8)> t45, t69, undef:i64
t81: i64 = add FrameIndex:i64<-1>, Constant:i64<16>
t97: i32,ch = load<(load (s32) from %fixed-stack.0 + 16, align 16)> t45, t81, undef:i64
t43: ch = AArch64ISD::RET_FLAG t42, Register:f32 $s0, Register:i64 $x21, t42:1
```
Post DAG:
```
SelectionDAG has 38 nodes:
t0: ch = EntryToken
t6: ch,glue = callseq_start t0, TargetConstant:i64<0>, TargetConstant:i64<0>
t10: ch,glue = CopyToReg t6, Register:i64 $x0, Constant:i64<16>
t13: ch,glue = AArch64ISD::CALL t10, TargetGlobalAddress:i64<i8* (i64)* @malloc> 0, Register:i64 $x0, RegisterMask:Untyped, t10:1
t14: ch,glue = callseq_end t13, TargetConstant:i64<0>, TargetConstant:i64<0>, t13:1
t15: i64,ch,glue = CopyFromReg t14, Register:i64 $x0, t14:1
t17: ch = CopyToReg t15:1, Register:i64 %1, t15
t19: i64 = add nuw t15, Constant:i64<8>
t45: ch = store<(store (s8) into %ir.tmp), trunc to i8> t17, Constant:i32<1>, t19, undef:i64
t99: ch = store<(store (s32) into %ir.a12)> t17, t97, FrameIndex:i64<3>, undef:i64
t90: ch = store<(store (s64) into %ir.args)> t17, t97:1, FrameIndex:i64<0>, undef:i64
t102: ch = store<(store (s32) into %ir.a11)> t17, t84, FrameIndex:i64<2>, undef:i64
t106: ch = store<(store (s32) into %ir.a10)> t17, t70, FrameIndex:i64<1>, undef:i64
t109: ch = TokenFactor t99, t97:2, t90, t102, t84:2, t106, t70:1
t40: ch,glue = CopyToReg t109, Register:f32 $s0, ConstantFP:f32<1.000000e+00>
t42: ch,glue = CopyToReg t40, Register:i64 $x21, Register:i64 %1, t40:1
t70: i32,ch = load<(load (s32) from %fixed-stack.0, align 16)> t45, FrameIndex:i64<-1>, undef:i64
t73: i64 = or FrameIndex:i64<-1>, Constant:i64<8>
t84: i32,i64,ch = load<(load (s32), align 8), <post-inc>> t45, t73, Constant:i64<8>
t97: i32,i64,ch = load<(load (s32)), <post-inc>> t45, t84:1, Constant:i64<8>
t43: ch = AArch64ISD::RET_FLAG t42, Register:f32 $s0, Register:i64 $x21, t42:1
```
Once again, this looks like it is improving things at the DAGCombine level, but trips up later stages in the backend.
================
Comment at: llvm/test/CodeGen/ARM/swifterror.ll:687
+; CHECK-APPLE-NEXT: add r0, r0, #4
+; CHECK-APPLE-NEXT: ldr r2, [r7, #8]
+; CHECK-APPLE-NEXT: ldr r1, [r0], #4
----------------
efriedma wrote:
> deadalnix wrote:
> > deadalnix wrote:
> > > RKSimon wrote:
> > > > This looks like another lost multiple load/store ?
> > > Yes, this looks similar to what is happening to arm64-abi-varargs.ll .
> > Investigating this a bit more, it looks like we get some crazy combine happening when CombineTo work in topological order that are missed otherwise.
> >
> > The dag looks like this with topological CombineTo:
> > ```
> > SelectionDAG has 58 nodes:
> > t0: ch = EntryToken
> > t5: i32,ch = CopyFromReg t0, Register:i32 %1
> > t9: i32 = add FrameIndex:i32<-1>, Constant:i32<4>
> > t11: i32,ch = CopyFromReg t0, Register:i32 %2
> > t15: i32,ch = CopyFromReg t0, Register:i32 %3
> > t19: i32,ch = CopyFromReg t0, Register:i32 %4
> > t73: i32 = add FrameIndex:i32<-1>, Constant:i32<12>
> > t68: ch = store<(store (s32) into unknown-address + 12)> t0, t19, t73, undef:i32
> > t77: i32 = add FrameIndex:i32<-1>, Constant:i32<8>
> > t74: ch = store<(store (s32) into unknown-address + 8)> t0, t15, t77, undef:i32
> > t78: ch = store<(store (s32) into %fixed-stack.0 + 4)> t0, t11, t9, undef:i32
> > t81: ch = store<(store (s32) into %fixed-stack.0)> t0, t5, FrameIndex:i32<-1>, undef:i32
> > t83: ch = TokenFactor t19:1, t68, t15:1, t74, t11:1, t78, t5:1, t81
> > t29: ch,glue = callseq_start t83, TargetConstant:i32<0>, TargetConstant:i32<0>
> > t33: ch,glue = CopyToReg t29, Register:i32 $r0, Constant:i32<16>
> > t35: ch,glue = CopyToReg t33, Register:i32 $r1, Constant:i32<0>, t33:1
> > t38: ch,glue = ARMISD::CALL t35, TargetGlobalAddress:i32<i8* (i64)* @malloc> 0, Register:i32 $r0, Register:i32 $r1, RegisterMask:Untyped, t35:1
> > t40: ch,glue = callseq_end t38, TargetConstant:i32<0>, TargetConstant:i32<-1>, t38:1
> > t41: i32,ch,glue = CopyFromReg t40, Register:i32 $r0, t40:1
> > t43: ch = CopyToReg t41:1, Register:i32 %5, t41
> > t45: i32 = add nuw t41, Constant:i32<8>
> > t85: ch = store<(store (s8) into %ir.tmp), trunc to i8> t43, Constant:i32<1>, t45, undef:i32
> > t137: ch = store<(store (s32) into %ir.a12)> t43, t135, FrameIndex:i32<3>, undef:i32
> > t128: ch = store<(store (s32) into %ir.args, align 8)> t43, t135:1, FrameIndex:i32<0>, undef:i32
> > t140: ch = store<(store (s32) into %ir.a11)> t43, t122, FrameIndex:i32<2>, undef:i32
> > t144: ch = store<(store (s32) into %ir.a10)> t43, t109, FrameIndex:i32<1>, undef:i32
> > t147: ch = TokenFactor t137, t135:2, t128, t140, t122:2, t144, t109:1
> > t64: ch,glue = CopyToReg t147, Register:i32 $r0, Constant:i32<1065353216>
> > t66: ch,glue = CopyToReg t64, Register:i32 $r8, Register:i32 %5, t64:1
> > t109: i32,ch = load<(load (s32))> t85, FrameIndex:i32<-1>, undef:i32
> > t122: i32,i32,ch = load<(load (s32) from %fixed-stack.0 + 8), <post-inc>> t85, t9, Constant:i32<4>
> > t135: i32,i32,ch = load<(load (s32)), <post-inc>> t85, t122:1, Constant:i32<4>
> > t67: ch = ARMISD::RET_FLAG t66, Register:i32 $r0, Register:i32 $r8, t66:1
> > ```
> >
> > And without:
> > ```
> > SelectionDAG has 58 nodes:
> > t0: ch = EntryToken
> > t5: i32,ch = CopyFromReg t0, Register:i32 %1
> > t9: i32 = add FrameIndex:i32<-1>, Constant:i32<4>
> > t11: i32,ch = CopyFromReg t0, Register:i32 %2
> > t15: i32,ch = CopyFromReg t0, Register:i32 %3
> > t19: i32,ch = CopyFromReg t0, Register:i32 %4
> > t77: i32 = add FrameIndex:i32<-1>, Constant:i32<8>
> > t73: i32 = add FrameIndex:i32<-1>, Constant:i32<12>
> > t68: ch = store<(store (s32) into unknown-address + 12)> t0, t19, t73, undef:i32
> > t74: ch = store<(store (s32) into unknown-address + 8)> t0, t15, t77, undef:i32
> > t78: ch = store<(store (s32) into %fixed-stack.0 + 4)> t0, t11, t9, undef:i32
> > t81: ch = store<(store (s32) into %fixed-stack.0)> t0, t5, FrameIndex:i32<-1>, undef:i32
> > t83: ch = TokenFactor t19:1, t68, t15:1, t74, t11:1, t78, t5:1, t81
> > t29: ch,glue = callseq_start t83, TargetConstant:i32<0>, TargetConstant:i32<0>
> > t33: ch,glue = CopyToReg t29, Register:i32 $r0, Constant:i32<16>
> > t35: ch,glue = CopyToReg t33, Register:i32 $r1, Constant:i32<0>, t33:1
> > t38: ch,glue = ARMISD::CALL t35, TargetGlobalAddress:i32<i8* (i64)* @malloc> 0, Register:i32 $r0, Register:i32 $r1, RegisterMask:Untyped, t35:1
> > t40: ch,glue = callseq_end t38, TargetConstant:i32<0>, TargetConstant:i32<-1>, t38:1
> > t41: i32,ch,glue = CopyFromReg t40, Register:i32 $r0, t40:1
> > t43: ch = CopyToReg t41:1, Register:i32 %5, t41
> > t45: i32 = add nuw t41, Constant:i32<8>
> > t85: ch = store<(store (s8) into %ir.tmp), trunc to i8> t43, Constant:i32<1>, t45, undef:i32
> > t135: ch = store<(store (s32) into %ir.a12)> t43, t132, FrameIndex:i32<3>, undef:i32
> > t127: ch = store<(store (s32) into %ir.args, align 8)> t43, t73, FrameIndex:i32<0>, undef:i32
> > t138: ch = store<(store (s32) into %ir.a11)> t43, t120, FrameIndex:i32<2>, undef:i32
> > t142: ch = store<(store (s32) into %ir.a10)> t43, t109, FrameIndex:i32<1>, undef:i32
> > t145: ch = TokenFactor t135, t132:1, t127, t138, t120:1, t142, t109:1
> > t64: ch,glue = CopyToReg t145, Register:i32 $r0, Constant:i32<1065353216>
> > t66: ch,glue = CopyToReg t64, Register:i32 $r8, Register:i32 %5, t64:1
> > t109: i32,ch = load<(load (s32))> t85, FrameIndex:i32<-1>, undef:i32
> > t120: i32,ch = load<(load (s32))> t85, t9, undef:i32
> > t132: i32,ch = load<(load (s32))> t85, t77, undef:i32
> > t67: ch = ARMISD::RET_FLAG t66, Register:i32 $r0, Register:i32 $r8, t66:1
> > ```
> >
> > The meaningful change seems to be these crazy loads:
> > ```
> > t122: i32,i32,ch = load<(load (s32) from %fixed-stack.0 + 8), <post-inc>> t85, t9, Constant:i32<4>
> > t135: i32,i32,ch = load<(load (s32)), <post-inc>> t85, t122:1, Constant:i32<4>
> > ```
> >
> > It'd be really helpful to have some ARM expert here to help us knowing what the right path forward is.
> We fail to form a post-inc load, which changes the register allocation, which blocks use of "ldm".
>
> Post-inc load formation failing to trigger is a bit concerning; makes me think we somehow aren't running the relevant combine late enough. DAGCombiner::CombineToPreIndexedLoadStore is a target-independent combine with a bit of target-specific code; see getPreIndexedAddressParts().
>
> "ldm" formation is unreliable on ARM because it runs after register allocation; I wouldn't worry about that part.
I have to apologize because I now notice a major error in my previous message. I swapped the two DAGs. The post-inc loads are in this diff's version of the DAG, and they do not exists on `main`. This has typically be the pattern with these changes: they allow DAGCombine to do more, but sometime, later stages in the pipeline gets confused. Nevertheless, if ldm is combine post register allocation, then it is indeed going to be fragile.
When you say that you wouldn't worry about this, do you mean that, because ldm is generally unreliable (as it happen post register allocation), then we should proceed regardless of this specific change, assuming there is no other outstanding open question?
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D124743/new/
https://reviews.llvm.org/D124743
More information about the llvm-commits
mailing list