[PATCH] D124743: [DAGCombine] Add node in the worklist in topological order in CombineTo

Fri May 6 17:40:35 PDT 2022

deadalnix added inline comments.

================
Comment at: llvm/test/CodeGen/ARM/swifterror.ll:687
+; CHECK-APPLE-NEXT:    add r0, r0, #4
+; CHECK-APPLE-NEXT:    ldr r2, [r7, #8]
+; CHECK-APPLE-NEXT:    ldr r1, [r0], #4
----------------
efriedma wrote:
> deadalnix wrote:
> > efriedma wrote:
> > > deadalnix wrote:
> > > > deadalnix wrote:
> > > > > RKSimon wrote:
> > > > > > This looks like another lost multiple load/store ?
> > > > > Yes, this looks similar to what is happening to arm64-abi-varargs.ll .
> > > > Investigating this a bit more, it looks like we get some crazy combine happening when CombineTo work in topological order that are missed otherwise.
> > > > 
> > > > The dag looks like this with topological CombineTo:
> > > > ```
> > > > SelectionDAG has 58 nodes:
> > > >   t0: ch = EntryToken
> > > >   t5: i32,ch = CopyFromReg t0, Register:i32 %1
> > > >   t9: i32 = add FrameIndex:i32<-1>, Constant:i32<4>
> > > >   t11: i32,ch = CopyFromReg t0, Register:i32 %2
> > > >   t15: i32,ch = CopyFromReg t0, Register:i32 %3
> > > >   t19: i32,ch = CopyFromReg t0, Register:i32 %4
> > > >           t73: i32 = add FrameIndex:i32<-1>, Constant:i32<12>
> > > >         t68: ch = store<(store (s32) into unknown-address + 12)> t0, t19, t73, undef:i32
> > > >           t77: i32 = add FrameIndex:i32<-1>, Constant:i32<8>
> > > >         t74: ch = store<(store (s32) into unknown-address + 8)> t0, t15, t77, undef:i32
> > > >         t78: ch = store<(store (s32) into %fixed-stack.0 + 4)> t0, t11, t9, undef:i32
> > > >         t81: ch = store<(store (s32) into %fixed-stack.0)> t0, t5, FrameIndex:i32<-1>, undef:i32
> > > >       t83: ch = TokenFactor t19:1, t68, t15:1, t74, t11:1, t78, t5:1, t81
> > > >     t29: ch,glue = callseq_start t83, TargetConstant:i32<0>, TargetConstant:i32<0>
> > > >   t33: ch,glue = CopyToReg t29, Register:i32 $r0, Constant:i32<16>
> > > >   t35: ch,glue = CopyToReg t33, Register:i32 $r1, Constant:i32<0>, t33:1
> > > >   t38: ch,glue = ARMISD::CALL t35, TargetGlobalAddress:i32<i8* (i64)* @malloc> 0, Register:i32 $r0, Register:i32 $r1, RegisterMask:Untyped, t35:1
> > > >   t40: ch,glue = callseq_end t38, TargetConstant:i32<0>, TargetConstant:i32<-1>, t38:1
> > > >   t41: i32,ch,glue = CopyFromReg t40, Register:i32 $r0, t40:1
> > > >   t43: ch = CopyToReg t41:1, Register:i32 %5, t41
> > > >     t45: i32 = add nuw t41, Constant:i32<8>
> > > >   t85: ch = store<(store (s8) into %ir.tmp), trunc to i8> t43, Constant:i32<1>, t45, undef:i32
> > > >       t137: ch = store<(store (s32) into %ir.a12)> t43, t135, FrameIndex:i32<3>, undef:i32
> > > >       t128: ch = store<(store (s32) into %ir.args, align 8)> t43, t135:1, FrameIndex:i32<0>, undef:i32
> > > >       t140: ch = store<(store (s32) into %ir.a11)> t43, t122, FrameIndex:i32<2>, undef:i32
> > > >       t144: ch = store<(store (s32) into %ir.a10)> t43, t109, FrameIndex:i32<1>, undef:i32
> > > >     t147: ch = TokenFactor t137, t135:2, t128, t140, t122:2, t144, t109:1
> > > >   t64: ch,glue = CopyToReg t147, Register:i32 $r0, Constant:i32<1065353216>
> > > >   t66: ch,glue = CopyToReg t64, Register:i32 $r8, Register:i32 %5, t64:1
> > > >   t109: i32,ch = load<(load (s32))> t85, FrameIndex:i32<-1>, undef:i32
> > > >   t122: i32,i32,ch = load<(load (s32) from %fixed-stack.0 + 8), <post-inc>> t85, t9, Constant:i32<4>
> > > >   t135: i32,i32,ch = load<(load (s32)), <post-inc>> t85, t122:1, Constant:i32<4>
> > > >   t67: ch = ARMISD::RET_FLAG t66, Register:i32 $r0, Register:i32 $r8, t66:1
> > > > ```
> > > > 
> > > > And without:
> > > > ```
> > > > SelectionDAG has 58 nodes:
> > > >   t0: ch = EntryToken
> > > >   t5: i32,ch = CopyFromReg t0, Register:i32 %1
> > > >   t9: i32 = add FrameIndex:i32<-1>, Constant:i32<4>
> > > >   t11: i32,ch = CopyFromReg t0, Register:i32 %2
> > > >   t15: i32,ch = CopyFromReg t0, Register:i32 %3
> > > >   t19: i32,ch = CopyFromReg t0, Register:i32 %4
> > > >   t77: i32 = add FrameIndex:i32<-1>, Constant:i32<8>
> > > >   t73: i32 = add FrameIndex:i32<-1>, Constant:i32<12>
> > > >         t68: ch = store<(store (s32) into unknown-address + 12)> t0, t19, t73, undef:i32
> > > >         t74: ch = store<(store (s32) into unknown-address + 8)> t0, t15, t77, undef:i32
> > > >         t78: ch = store<(store (s32) into %fixed-stack.0 + 4)> t0, t11, t9, undef:i32
> > > >         t81: ch = store<(store (s32) into %fixed-stack.0)> t0, t5, FrameIndex:i32<-1>, undef:i32
> > > >       t83: ch = TokenFactor t19:1, t68, t15:1, t74, t11:1, t78, t5:1, t81
> > > >     t29: ch,glue = callseq_start t83, TargetConstant:i32<0>, TargetConstant:i32<0>
> > > >   t33: ch,glue = CopyToReg t29, Register:i32 $r0, Constant:i32<16>
> > > >   t35: ch,glue = CopyToReg t33, Register:i32 $r1, Constant:i32<0>, t33:1
> > > >   t38: ch,glue = ARMISD::CALL t35, TargetGlobalAddress:i32<i8* (i64)* @malloc> 0, Register:i32 $r0, Register:i32 $r1, RegisterMask:Untyped, t35:1
> > > >   t40: ch,glue = callseq_end t38, TargetConstant:i32<0>, TargetConstant:i32<-1>, t38:1
> > > >   t41: i32,ch,glue = CopyFromReg t40, Register:i32 $r0, t40:1
> > > >   t43: ch = CopyToReg t41:1, Register:i32 %5, t41
> > > >     t45: i32 = add nuw t41, Constant:i32<8>
> > > >   t85: ch = store<(store (s8) into %ir.tmp), trunc to i8> t43, Constant:i32<1>, t45, undef:i32
> > > >       t135: ch = store<(store (s32) into %ir.a12)> t43, t132, FrameIndex:i32<3>, undef:i32
> > > >       t127: ch = store<(store (s32) into %ir.args, align 8)> t43, t73, FrameIndex:i32<0>, undef:i32
> > > >       t138: ch = store<(store (s32) into %ir.a11)> t43, t120, FrameIndex:i32<2>, undef:i32
> > > >       t142: ch = store<(store (s32) into %ir.a10)> t43, t109, FrameIndex:i32<1>, undef:i32
> > > >     t145: ch = TokenFactor t135, t132:1, t127, t138, t120:1, t142, t109:1
> > > >   t64: ch,glue = CopyToReg t145, Register:i32 $r0, Constant:i32<1065353216>
> > > >   t66: ch,glue = CopyToReg t64, Register:i32 $r8, Register:i32 %5, t64:1
> > > >   t109: i32,ch = load<(load (s32))> t85, FrameIndex:i32<-1>, undef:i32
> > > >   t120: i32,ch = load<(load (s32))> t85, t9, undef:i32
> > > >   t132: i32,ch = load<(load (s32))> t85, t77, undef:i32
> > > >   t67: ch = ARMISD::RET_FLAG t66, Register:i32 $r0, Register:i32 $r8, t66:1
> > > > ```
> > > > 
> > > > The meaningful change seems to be these crazy loads:
> > > > ```
> > > >   t122: i32,i32,ch = load<(load (s32) from %fixed-stack.0 + 8), <post-inc>> t85, t9, Constant:i32<4>
> > > >   t135: i32,i32,ch = load<(load (s32)), <post-inc>> t85, t122:1, Constant:i32<4>
> > > > ```
> > > > 
> > > > It'd be really helpful to have some ARM expert here to help us knowing what the right path forward is.
> > > We fail to form a post-inc load, which changes the register allocation, which blocks use of "ldm".
> > > 
> > > Post-inc load formation failing to trigger is a bit concerning; makes me think we somehow aren't running the relevant combine late enough.  DAGCombiner::CombineToPreIndexedLoadStore is a target-independent combine with a bit of target-specific code; see getPreIndexedAddressParts().
> > > 
> > > "ldm" formation is unreliable on ARM because it runs after register allocation; I wouldn't worry about that part.
> > I have to apologize because I now notice a major error in my previous message. I swapped the two DAGs. The post-inc loads are in this diff's version of the DAG, and they do not exists on `main`. This has typically be the pattern with these changes: they allow DAGCombine to do more, but sometime, later stages in the pipeline gets confused. Nevertheless, if ldm is combine post register allocation, then it is indeed going to be fragile.
> > 
> > When you say that you wouldn't worry about this, do you mean that, because ldm is generally unreliable (as it happen post register allocation), then we should proceed regardless of this specific change, assuming there is no other outstanding open question?
> Oh, we're forming more post-inc loads?  Then sure, this test change is fine as-is.
> 
> If it causes issues in practice, we can mess with the post-inc formation heuristics.
Yes, we are forming more of them. In a way, this is the whole point of this patch: traverse the node in a way that exposes more combining opportunities, but in this case, it looks like we either don't want to do the combine as aggressively, or we want to handle it better in later stages. Regressions in the ARM and AArch64 backends are variation around that same theme.

Now the question is what do we do from there? Do we land this and try to mess with the heuristics?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124743/new/

https://reviews.llvm.org/D124743