[PATCH] MachineSink: Fix and tweak critical-edge breaking heuristic.

Mon Oct 14 10:03:21 PDT 2013

r192608, thanks!!

~Will

On Mon, Oct 14, 2013 at 11:18 AM, Andrew Trick <atrick at apple.com> wrote:
>
> On Oct 12, 2013, at 8:04 PM, Will Dietz <w at wdtz.org> wrote:
>
> Thanks for the response and for testing!  Much appreciated :).
>
> I took at look at misched-balance.ll, and it seems
> the difference is due to on OSX extra registers r8-r11 are used
> for saving the results of GOTPCREL, while these
> registers aren't used on linux since it directly references "a(%rip)".
>
> Attached is an updated version of the patch that changes
> this test to explicitly use the linux triple so the test
> has the same result on all platforms.
>
> Look good to commit? :)
>
>
> Yep. LGTM. Thanks.
> -Andy
>
>
> Thank you for your time!
>
> ~Will
>
>
> On Sat, Oct 12, 2013 at 12:16 AM, Andrew Trick <atrick at apple.com> wrote:
>
>
> On Oct 11, 2013, at 5:49 PM, Will Dietz <w at wdtz.org> wrote:
>
> Updated patch attached.
>
> Fixed the select test (although still not sure what it's supposed to
> be testing exactly)
> and XFAIL'ing the hoist-common test since it's broken as-is regardless
> of this change.
>
> Also, added a functionality test where the new heuristic results in nicer
> code.
>
> FWIW it seems other changes in mainline have hidden the performance
> issue encountered in
> the original bug report[1] but pushing for this change anyway since it seems
> better for the reasons mentioned in this discussion and the commit message
> (and it's possible the underlying issue will crop up later if not
> addressed).
>
> Thank you for your time!
>
> ~Will
>
> [1] http://llvm.org/PR17245
>
>
> Thanks for pursuing this. Sorry I didn’t respond last time. This looks fine
> to me. I think you’ve done as much as can be expected to keep the unit tests
> functioning.
>
> With your patch, I see a failure in misched-balance.ll @encpc1 (strange that
> you didn’t see it). You are correctly not sinking as much code, but we have
> extra register pressure. I think you add a couple CHECK lines to allow for a
> single callee-save spill but no more:
>
> +; CHECK: push
> ; CHECK-NOT: push
> +; CHECK: pop
> ; CHECK-NOT: pop
>
> Thanks!
>
> -Andy
>
> On Tue, Sep 17, 2013 at 5:14 PM, Will Dietz <w at wdtz.org> wrote:
>
> (Renato, adding you to cc per discussion on the bugzilla)
>
> Thank you both for helping moving this forward.
>
> Attached is an updated patch that
>
> a) Fixes comment text (sorry!)
> b) Resolves most of the lit tests.
>
> The remaining two tests that fail are:
>
> Failing Tests (2):
>   LLVM :: CodeGen/ARM/2012-08-30-select.ll
>   LLVM :: CodeGen/X86/hoist-common.ll
>
> Explanation follow:
>
> [CodeGen/ARM/2012-08-30-select.ll]
> It's unclear what this test is attempting to test, so I'm not sure if
> it's broken or not :/.
> Additionally it seems unclear which assembly is more desirable for
> LLVM to produce in this case.
> Would pinging the test's author be appropriate here?
>
>
> [CodeGen/X86/hoist-common.ll]
> The more I look into this one the more it seems that the originally
> tested functionality hasn't been
> working as expected for a while.  For reference, here's the commit
> introducing this test (commit message is relevant):
>
> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20110509/120636.html
>
> The assembly produced by ToT on my machine is as follows:
>
> hoist-common-before-patch:
> ---------------------------
>       .section        __TEXT,__text,regular,pure_instructions
>       .globl  _t
>       .align  4, 0x90
> _t:                                     ## @t
> ## BB#0:                                ## %entry
>       pushq   %rax
>       xorl    %eax, %eax
>       testl   %edi, %edi
>       je      LBB0_2
> ## BB#1:                                ## %if.then
>       xorl    %eax, %eax
>       callq   _foo
>                                       ## kill: AL<def> AL<kill> EAX<def>
> LBB0_2:                                 ## %return
>                                       ## kill: AL<def> AL<kill> EAX<kill>
>       popq    %rdx
>       ret
> ---------------------------
>
> which is leaving the redundant 'xorl' before the call to foo.
>
> Adding some debug prints to BranchFolding suggests it's not triggering
> at all on ToT for this code,
> but with my patch (which produces the assembly given below) it
> considers hoisting the two
> MOV32r0's but punts after concluding the two are not identical.  FWIW,
> at the MI level, they look like this:
>
> %EAX<def,dead> = MOV32r0 %EFLAGS<imp-def,dead>, %AL<imp-def>
>
> and
>
> %EAX<def> = MOV32r0 %EFLAGS<imp-def,dead>
>
> As promised, here's the assembly with my patch:
>
> hoist-common-after-patch:
> ---------------------------
>       .section        __TEXT,__text,regular,pure_instructions
>       .globl  _t
>       .align  4, 0x90
> _t:                                     ## @t
> ## BB#0:                                ## %entry
>       pushq   %rax
>       testl   %edi, %edi
>       je      LBB0_1
> ## BB#2:                                ## %if.then
>       xorl    %eax, %eax
>       callq   _foo
>                                       ## kill: AL<def> AL<kill> EAX<def>
>       jmp     LBB0_3
> LBB0_1:
>       xorl    %eax, %eax
> LBB0_3:                                 ## %return
>                                       ## kill: AL<def> AL<kill> EAX<kill>
>       popq    %rdx
>       ret
> ---------------------------
>
> In short, something is wrong but it doesn't seem to be due to
> MachineSink's behavior.
>
> Thoughts/comments on how to best proceed would be appreciated! :)
>
> Thank you for your time,
>
> ~Will
>
> PS I lurk on #llvm as dtzWill if you're on IRC, that might help get
> this resolved more quickly :).
>
> On Tue, Sep 17, 2013 at 3:49 PM, Will Dietz <w at wdtz.org> wrote:
>
> Thanks for your response, Andrew!
>
> Glad I'm on the right track :).
>
> Regarding the comment, let me try to explain again and if that seems
> agreeable
> I'll update the patch so it's understandable:
>
> The heuristic here is primarily trying to reason about whether this cheap
> instruction blocks another instruction from being sunk as well.  We want
> at least two instructions to merit splitting the critical edge.
>
> To determine this, we check:
>
> * The definition has only one use.
>
> I admint to making use of this partially because of its use elsewhere
> in this code,
> but the primary reasoning is that it looks like a cheap property to check
> to ensure the definition would still dominate all its uses if it was moved
> along
> with this use.  If there's just the single use then moving them together
> trivially satisfies this.  If there are more uses this becomes more
> complicated.
>
> * Definition lives in same MBB
>
> This is what the quoted comment is attempting to justify.  This relies
> on the fact that
> MachineSink only moves instructions into direct successors, iterating
> to move instructions
> further than that (and taking advantage of other opportunities as it
> goes).  Since we're looking
> for cases where moving the definition with the use is possible, they must
> reside
> in the same block for that to be possible this iteration.
>
> As an aside, if moving this use were to lead to MachineSink moving the
> definition
> into the new split-edge block in future iterations I believe it's
> necessarily the case that
> MachineSink in future iterations will also move the definition into *this*
> block
> without us moving the use.  Using this property saves us from more expensive
> reasoning wihile still handling some more complicated situations nicely
> (and avoiding unnecessarily breaking the edge for cases where the
> definition cannot be sunk into this block first).
>
> This is somewhat weak as it becomes invalidated if/when MachineSink
> is improved to move instructions farther than immediate successors.
> However I think if that were to happen it'd be enough of a rewrite
> that this heuristic would be unlikely to still be used at all.
>
> -----------------------
>
> Having just written all this up I see now that my comment is actually
> just plain malformed, not necessarily that the ideas behind it
> were misunderstood or needed explaining.  Sorry about that,
> sending along anyway for sanity checking :).
>
> Does the above make sense and sound reasonable?
>
> I'll take a look at the tests...
>
> (Have you seen the discussion on the bug report?)
>
> Thanks!
>
> ~Will
>
>
> On Tue, Sep 17, 2013 at 1:12 AM, Andrew Trick <atrick at apple.com> wrote:
>
>
> On Sep 14, 2013, at 2:53 PM, Will Dietz <w at wdtz.org> wrote:
>
> Managed to reduce the IR, attached to newly filed PR17245
> (http://llvm.org/PR17245 ) for discussion and investigation.
>
>
> This looks great. Thanks Will.
>
> I can't quite parse this comment.
>
> +      // reason about blocking it.  Other iterations
> +      // will look to sink it block and block until
> +      // it resides in this MBB at which point
> +      // this will evaluate to true.
>
> Regarding the test failures. I think you should determine whether:
> (1) MachineSinking is now doing the intended thing
> (2) We're not breaking any codegen patterns (downstream passes still do
> something reasonable).
> (3) The test is still testing something.
>
> If all of the above, then just fixup the CHECK lines.
>
> For example, in CodeGen/ARM/2011-04-11-MachineLICMBug.ll, you're now sinking
> multiple instructions into the loop's exit paths. I think that's intended
> behavior since we consider multiple instructions to be profitable for
> splitting. The final code looks better. And the test still checks that LICM
> doesn't generate extra copies. It looks like the CHECK labels could use some
> adjustment.
>
> -Andy
>
>
> On Tue, Sep 10, 2013 at 8:24 PM, Will Dietz <w at wdtz.org> wrote:
>
> Apologies for not better motivating this in my original post,
> I'd already spent more time on this than I probably should (dang
> interesting compilers!)
> and was hoping what I had would be sufficient.
>
> Since no one's responded yet I'll try to better explain the issue
> and why this is important.  Thoughts/comments welcome :).
>
> Here's a before-and-after stats and timings for the regression I'm
> observing:
>
> Before: https://pastee.org/3w6fa (28m47s)
> After: https://pastee.org/6hx3u (3m49s)
>
> The code in question is an instrumented copy of some image
> manipulation code from ImageMagick.
> I'd rather not post it as it was generated by an unpublished research
> project, but would be happy
> to provide it upon request to anyone interested in looking into this issue.
>
> That said, the BC probably isn't required for understanding why the
> existing code is problematic:
>
> The timings above show something certainly isn't working properly
> in terms of scalability (look at the MachineSink timings!)
> and I walk through the issue in more detail below.
> Regardless of my patch's details hopefully this will shed
> some light on the situation and help get this resolved.
>
> In short, the heuristic in isWorthBreakingCriticalEdge() is O(n) in the
> number
> of instructions within the function that define registers used by the
> instruction
> being queried.  This leads to unscalable behavior when the number of
> candidate
> cheap instructions for sinking is particularly high, potentially O(n^2)
> overall.
>
> Looking deeper, it seems the hasOneNonDBGUse() is expensive when the
> register
> has many definitions as it has to walk through all of the definitions
> before reaching a use.
> This happens with EFLAGS easily since many instructions touch it--
> on my problematic IR this results in walking over 300k operands for each
> call to hasOneNonDBGUse().
>
> From this alone I'd suggest the code be changed to avoid this
> unscalable behavior, especially since the function is already labelled
> as using questionable heuristics.  Note that simply avoiding calling
> hasOneNonDBGUse() on physical registers is an effective solution
> since virtual registers will never have more than one definition
> due to their SSA-like nature.  Furthermore, I argue that ignoring both
> uses and definitions physical registers is acceptable (see below).
>
> While investigating the scalability issue, it appears
> the loop does not do what it was originally intended to do,
> judging by the comment above it (reproduced below):
>
> -------------------
> // MI is cheap, we probably don't want to break the critical edge for it.
> // However, if this would allow some definitions of its source operands
> // to be sunk then it's probably worth it.
> -------------------
>
> Which doesn't seem to be what the loop presently does:
>
> * It considers registers /defined/ by this instruction.  While sinking
> an instruction that defines a register with a single use might
> be helpful for other reasons (reducing the live range),
> this isn't what the comment claims the purpose of the loop is.
> * Physical register uses aren't interesting because MachineSink
> refuses to sink instructions that define physical registers.  This means
> sinking the use of a physical register will not unblock its definition
> since the definition is incapable of being sunk.
> * It makes no attempt to determine if the definition instruction
> is suitable for sinking into the broken critical edge in the first place.
> I improve this somewhat by only recognizing an instruction as a "blocker"
> if the defining instruction is in the same MBB.  This is incomplete of
> course,
> but helps avoid spuriously breaking critical edges for instructions we know
> can't be sunk with this one.
>
> Hope this helps clarify and explain the situation.  Honestly the right
> answer
> might lie elsewhere (perhaps in a more thorough reworking of the pass),
> but unfortunately that's more than I have time for presently :(.
>
> Thoughts/comments welcome! :)
>
> ~Will
>
>
> On Sat, Sep 7, 2013 at 10:51 PM, Will Dietz <w at wdtz.org> wrote:
>
> See attached, introduction is below and a description from commit
> message is repeated at the end of this message for easier processing
> and discussion.
>
> This fixes a serious compile-time performance issue for me and hopefully
> causes LLVM to generate slightly nicer code.  Unfortunately I'm not in
> a position
> to do appropriate performance or platform testing of this (perhaps
> minor) change, so please let me know how I can best proceed or if
> that's something
> to leave to nightly testing after the change is finished and passes review.
>
> Unfortunately, this breaks a few lit tests in ways that aren't
> obviously problematic upon inspection, but it's unclear to me how to
> properly resolve them.  The tests in question are:
>
>  LLVM :: CodeGen/ARM/2011-04-11-MachineLICMBug.ll
>  LLVM :: CodeGen/ARM/2011-08-25-ldmia_ret.ll
>  LLVM :: CodeGen/ARM/2012-08-30-select.ll
>  LLVM :: CodeGen/X86/2012-11-30-handlemove-dbg.ll
>  LLVM :: CodeGen/X86/fold-load.ll
>  LLVM :: CodeGen/X86/hoist-common.ll
>
> Can someone more familiar with these help me work through them?
>
> A self-hosted copy of LLVM/Clang with this change passes its tests
> (other than the ones above) and seems to build various applications
> properly, so
> I'm hopeful these just need to be updated to accommodate this change
> and aren't signs of regressions or correctness issues.
>
> Please let me know if you have any questions/comments/feedback, and
> thank you for your time.
>
> --------------------------------------------------------
>
> Per original comment, the intention of this loop
> is to go ahead and break the critical edge
> (in order to sink this instruction) if there's
> reason to believe doing so might "unblock" the
> sinking of additional instructions that define
> registers used by this one.  The idea is that if
> we have a few instructions to sink "together"
> breaking the edge might be worthwhile.
>
> This commit makes a few small changes
> to help better realize this goal:
>
> First, modify the loop to ignore registers
> defined by this instruction.  We don't
> sink definitions of physical registers,
> and sinking an SSA definition isn't
> going to unblock an upstream instruction.
>
> Second, ignore uses of physical registers.
> Instructions that define physical registers are
> rejected for sinking, and so moving this one
> won't enable moving any defining instructions.
> As an added bonus, while virtual register
> use-def chains are generally small due
> to SSA goodness, iteration over the uses
> and definitions (used by hasOneNonDBGUse)
> for physical registers like EFLAGS
> can be rather expensive in practice.
> (This is the original reason for looking at this)
>
> Finally, to keep things simple continue
> to only consider this trick for registers that
> have a single use (via hasOneNonDBGUse),
> but to avoid spuriously breaking critical edges
> only do so if the definition resides
> in the same MBB and therefore this one directly
> blocks it from being sunk as well.
> If sinking them together is meant to be,
> let the iterative nature of this pass
> sink the definition into this block first.
>
> ~Will
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
> <0001-MachineSink-Fix-and-tweak-critical-edge-breaking-heu.patch>
>
>
> <0001-MachineSink-Fix-and-tweak-critical-edge-breaking-heu.patch>
>
>