[PATCH] MachineSink: Fix and tweak critical-edge breaking heuristic.

Sat Sep 14 14:53:18 PDT 2013

Managed to reduce the IR, attached to newly filed PR17245
(http://llvm.org/PR17245 ) for discussion and investigation.

Thanks!

On Tue, Sep 10, 2013 at 8:24 PM, Will Dietz <w at wdtz.org> wrote:
> Apologies for not better motivating this in my original post,
> I'd already spent more time on this than I probably should (dang
> interesting compilers!)
> and was hoping what I had would be sufficient.
>
> Since no one's responded yet I'll try to better explain the issue
> and why this is important.  Thoughts/comments welcome :).
>
> Here's a before-and-after stats and timings for the regression I'm observing:
>
> Before: https://pastee.org/3w6fa (28m47s)
> After: https://pastee.org/6hx3u (3m49s)
>
> The code in question is an instrumented copy of some image
> manipulation code from ImageMagick.
> I'd rather not post it as it was generated by an unpublished research
> project, but would be happy
> to provide it upon request to anyone interested in looking into this issue.
>
> That said, the BC probably isn't required for understanding why the
> existing code is problematic:
>
> The timings above show something certainly isn't working properly
> in terms of scalability (look at the MachineSink timings!)
> and I walk through the issue in more detail below.
> Regardless of my patch's details hopefully this will shed
> some light on the situation and help get this resolved.
>
> In short, the heuristic in isWorthBreakingCriticalEdge() is O(n) in the number
> of instructions within the function that define registers used by the
> instruction
> being queried.  This leads to unscalable behavior when the number of candidate
> cheap instructions for sinking is particularly high, potentially O(n^2) overall.
>
> Looking deeper, it seems the hasOneNonDBGUse() is expensive when the register
> has many definitions as it has to walk through all of the definitions
> before reaching a use.
> This happens with EFLAGS easily since many instructions touch it--
> on my problematic IR this results in walking over 300k operands for each
> call to hasOneNonDBGUse().
>
> From this alone I'd suggest the code be changed to avoid this
> unscalable behavior, especially since the function is already labelled
> as using questionable heuristics.  Note that simply avoiding calling
> hasOneNonDBGUse() on physical registers is an effective solution
> since virtual registers will never have more than one definition
> due to their SSA-like nature.  Furthermore, I argue that ignoring both
> uses and definitions physical registers is acceptable (see below).
>
> While investigating the scalability issue, it appears
> the loop does not do what it was originally intended to do,
> judging by the comment above it (reproduced below):
>
> -------------------
> // MI is cheap, we probably don't want to break the critical edge for it.
> // However, if this would allow some definitions of its source operands
> // to be sunk then it's probably worth it.
> -------------------
>
> Which doesn't seem to be what the loop presently does:
>
> * It considers registers /defined/ by this instruction.  While sinking
> an instruction that defines a register with a single use might
> be helpful for other reasons (reducing the live range),
> this isn't what the comment claims the purpose of the loop is.
> * Physical register uses aren't interesting because MachineSink
> refuses to sink instructions that define physical registers.  This means
> sinking the use of a physical register will not unblock its definition
> since the definition is incapable of being sunk.
> * It makes no attempt to determine if the definition instruction
> is suitable for sinking into the broken critical edge in the first place.
> I improve this somewhat by only recognizing an instruction as a "blocker"
> if the defining instruction is in the same MBB.  This is incomplete of course,
> but helps avoid spuriously breaking critical edges for instructions we know
> can't be sunk with this one.
>
> Hope this helps clarify and explain the situation.  Honestly the right answer
> might lie elsewhere (perhaps in a more thorough reworking of the pass),
> but unfortunately that's more than I have time for presently :(.
>
> Thoughts/comments welcome! :)
>
> ~Will
>
>
> On Sat, Sep 7, 2013 at 10:51 PM, Will Dietz <w at wdtz.org> wrote:
>> See attached, introduction is below and a description from commit
>> message is repeated at the end of this message for easier processing
>> and discussion.
>>
>> This fixes a serious compile-time performance issue for me and hopefully
>> causes LLVM to generate slightly nicer code.  Unfortunately I'm not in
>> a position
>> to do appropriate performance or platform testing of this (perhaps
>> minor) change, so please let me know how I can best proceed or if
>> that's something
>> to leave to nightly testing after the change is finished and passes review.
>>
>> Unfortunately, this breaks a few lit tests in ways that aren't
>> obviously problematic upon inspection, but it's unclear to me how to
>> properly resolve them.  The tests in question are:
>>
>>     LLVM :: CodeGen/ARM/2011-04-11-MachineLICMBug.ll
>>     LLVM :: CodeGen/ARM/2011-08-25-ldmia_ret.ll
>>     LLVM :: CodeGen/ARM/2012-08-30-select.ll
>>     LLVM :: CodeGen/X86/2012-11-30-handlemove-dbg.ll
>>     LLVM :: CodeGen/X86/fold-load.ll
>>     LLVM :: CodeGen/X86/hoist-common.ll
>>
>> Can someone more familiar with these help me work through them?
>>
>> A self-hosted copy of LLVM/Clang with this change passes its tests
>> (other than the ones above) and seems to build various applications
>> properly, so
>> I'm hopeful these just need to be updated to accommodate this change
>> and aren't signs of regressions or correctness issues.
>>
>> Please let me know if you have any questions/comments/feedback, and
>> thank you for your time.
>>
>> --------------------------------------------------------
>>
>> Per original comment, the intention of this loop
>> is to go ahead and break the critical edge
>> (in order to sink this instruction) if there's
>> reason to believe doing so might "unblock" the
>> sinking of additional instructions that define
>> registers used by this one.  The idea is that if
>> we have a few instructions to sink "together"
>> breaking the edge might be worthwhile.
>>
>> This commit makes a few small changes
>> to help better realize this goal:
>>
>> First, modify the loop to ignore registers
>> defined by this instruction.  We don't
>> sink definitions of physical registers,
>> and sinking an SSA definition isn't
>> going to unblock an upstream instruction.
>>
>> Second, ignore uses of physical registers.
>> Instructions that define physical registers are
>> rejected for sinking, and so moving this one
>> won't enable moving any defining instructions.
>> As an added bonus, while virtual register
>> use-def chains are generally small due
>> to SSA goodness, iteration over the uses
>> and definitions (used by hasOneNonDBGUse)
>> for physical registers like EFLAGS
>> can be rather expensive in practice.
>> (This is the original reason for looking at this)
>>
>> Finally, to keep things simple continue
>> to only consider this trick for registers that
>> have a single use (via hasOneNonDBGUse),
>> but to avoid spuriously breaking critical edges
>> only do so if the definition resides
>> in the same MBB and therefore this one directly
>> blocks it from being sunk as well.
>> If sinking them together is meant to be,
>> let the iterative nature of this pass
>> sink the definition into this block first.
>>
>> ~Will