[PATCH] MachineSink: Fix and tweak critical-edge breaking heuristic.

Tue Sep 17 13:49:59 PDT 2013

Thanks for your response, Andrew!

Glad I'm on the right track :).

Regarding the comment, let me try to explain again and if that seems agreeable
I'll update the patch so it's understandable:

The heuristic here is primarily trying to reason about whether this cheap
instruction blocks another instruction from being sunk as well.  We want
at least two instructions to merit splitting the critical edge.

To determine this, we check:

* The definition has only one use.

I admint to making use of this partially because of its use elsewhere
in this code,
but the primary reasoning is that it looks like a cheap property to check
to ensure the definition would still dominate all its uses if it was moved along
with this use.  If there's just the single use then moving them together
trivially satisfies this.  If there are more uses this becomes more complicated.

* Definition lives in same MBB

This is what the quoted comment is attempting to justify.  This relies
on the fact that
MachineSink only moves instructions into direct successors, iterating
to move instructions
further than that (and taking advantage of other opportunities as it
goes).  Since we're looking
for cases where moving the definition with the use is possible, they must reside
in the same block for that to be possible this iteration.

As an aside, if moving this use were to lead to MachineSink moving the
definition
into the new split-edge block in future iterations I believe it's
necessarily the case that
MachineSink in future iterations will also move the definition into *this* block
without us moving the use.  Using this property saves us from more expensive
reasoning wihile still handling some more complicated situations nicely
(and avoiding unnecessarily breaking the edge for cases where the
definition cannot be sunk into this block first).

This is somewhat weak as it becomes invalidated if/when MachineSink
is improved to move instructions farther than immediate successors.
However I think if that were to happen it'd be enough of a rewrite
that this heuristic would be unlikely to still be used at all.

-----------------------

Having just written all this up I see now that my comment is actually
just plain malformed, not necessarily that the ideas behind it
were misunderstood or needed explaining.  Sorry about that,
sending along anyway for sanity checking :).

Does the above make sense and sound reasonable?

I'll take a look at the tests...

(Have you seen the discussion on the bug report?)

Thanks!

~Will

On Tue, Sep 17, 2013 at 1:12 AM, Andrew Trick <atrick at apple.com> wrote:
>
> On Sep 14, 2013, at 2:53 PM, Will Dietz <w at wdtz.org> wrote:
>
>> Managed to reduce the IR, attached to newly filed PR17245
>> (http://llvm.org/PR17245 ) for discussion and investigation.
>
> This looks great. Thanks Will.
>
> I can't quite parse this comment.
>
> +      // reason about blocking it.  Other iterations
> +      // will look to sink it block and block until
> +      // it resides in this MBB at which point
> +      // this will evaluate to true.
>
> Regarding the test failures. I think you should determine whether:
> (1) MachineSinking is now doing the intended thing
> (2) We're not breaking any codegen patterns (downstream passes still do something reasonable).
> (3) The test is still testing something.
>
> If all of the above, then just fixup the CHECK lines.
>
> For example, in CodeGen/ARM/2011-04-11-MachineLICMBug.ll, you're now sinking multiple instructions into the loop's exit paths. I think that's intended behavior since we consider multiple instructions to be profitable for splitting. The final code looks better. And the test still checks that LICM doesn't generate extra copies. It looks like the CHECK labels could use some adjustment.
>
> -Andy
>
>>
>> On Tue, Sep 10, 2013 at 8:24 PM, Will Dietz <w at wdtz.org> wrote:
>>> Apologies for not better motivating this in my original post,
>>> I'd already spent more time on this than I probably should (dang
>>> interesting compilers!)
>>> and was hoping what I had would be sufficient.
>>>
>>> Since no one's responded yet I'll try to better explain the issue
>>> and why this is important.  Thoughts/comments welcome :).
>>>
>>> Here's a before-and-after stats and timings for the regression I'm observing:
>>>
>>> Before: https://pastee.org/3w6fa (28m47s)
>>> After: https://pastee.org/6hx3u (3m49s)
>>>
>>> The code in question is an instrumented copy of some image
>>> manipulation code from ImageMagick.
>>> I'd rather not post it as it was generated by an unpublished research
>>> project, but would be happy
>>> to provide it upon request to anyone interested in looking into this issue.
>>>
>>> That said, the BC probably isn't required for understanding why the
>>> existing code is problematic:
>>>
>>> The timings above show something certainly isn't working properly
>>> in terms of scalability (look at the MachineSink timings!)
>>> and I walk through the issue in more detail below.
>>> Regardless of my patch's details hopefully this will shed
>>> some light on the situation and help get this resolved.
>>>
>>> In short, the heuristic in isWorthBreakingCriticalEdge() is O(n) in the number
>>> of instructions within the function that define registers used by the
>>> instruction
>>> being queried.  This leads to unscalable behavior when the number of candidate
>>> cheap instructions for sinking is particularly high, potentially O(n^2) overall.
>>>
>>> Looking deeper, it seems the hasOneNonDBGUse() is expensive when the register
>>> has many definitions as it has to walk through all of the definitions
>>> before reaching a use.
>>> This happens with EFLAGS easily since many instructions touch it--
>>> on my problematic IR this results in walking over 300k operands for each
>>> call to hasOneNonDBGUse().
>>>
>>> From this alone I'd suggest the code be changed to avoid this
>>> unscalable behavior, especially since the function is already labelled
>>> as using questionable heuristics.  Note that simply avoiding calling
>>> hasOneNonDBGUse() on physical registers is an effective solution
>>> since virtual registers will never have more than one definition
>>> due to their SSA-like nature.  Furthermore, I argue that ignoring both
>>> uses and definitions physical registers is acceptable (see below).
>>>
>>> While investigating the scalability issue, it appears
>>> the loop does not do what it was originally intended to do,
>>> judging by the comment above it (reproduced below):
>>>
>>> -------------------
>>> // MI is cheap, we probably don't want to break the critical edge for it.
>>> // However, if this would allow some definitions of its source operands
>>> // to be sunk then it's probably worth it.
>>> -------------------
>>>
>>> Which doesn't seem to be what the loop presently does:
>>>
>>> * It considers registers /defined/ by this instruction.  While sinking
>>> an instruction that defines a register with a single use might
>>> be helpful for other reasons (reducing the live range),
>>> this isn't what the comment claims the purpose of the loop is.
>>> * Physical register uses aren't interesting because MachineSink
>>> refuses to sink instructions that define physical registers.  This means
>>> sinking the use of a physical register will not unblock its definition
>>> since the definition is incapable of being sunk.
>>> * It makes no attempt to determine if the definition instruction
>>> is suitable for sinking into the broken critical edge in the first place.
>>> I improve this somewhat by only recognizing an instruction as a "blocker"
>>> if the defining instruction is in the same MBB.  This is incomplete of course,
>>> but helps avoid spuriously breaking critical edges for instructions we know
>>> can't be sunk with this one.
>>>
>>> Hope this helps clarify and explain the situation.  Honestly the right answer
>>> might lie elsewhere (perhaps in a more thorough reworking of the pass),
>>> but unfortunately that's more than I have time for presently :(.
>>>
>>> Thoughts/comments welcome! :)
>>>
>>> ~Will
>>>
>>>
>>> On Sat, Sep 7, 2013 at 10:51 PM, Will Dietz <w at wdtz.org> wrote:
>>>> See attached, introduction is below and a description from commit
>>>> message is repeated at the end of this message for easier processing
>>>> and discussion.
>>>>
>>>> This fixes a serious compile-time performance issue for me and hopefully
>>>> causes LLVM to generate slightly nicer code.  Unfortunately I'm not in
>>>> a position
>>>> to do appropriate performance or platform testing of this (perhaps
>>>> minor) change, so please let me know how I can best proceed or if
>>>> that's something
>>>> to leave to nightly testing after the change is finished and passes review.
>>>>
>>>> Unfortunately, this breaks a few lit tests in ways that aren't
>>>> obviously problematic upon inspection, but it's unclear to me how to
>>>> properly resolve them.  The tests in question are:
>>>>
>>>>    LLVM :: CodeGen/ARM/2011-04-11-MachineLICMBug.ll
>>>>    LLVM :: CodeGen/ARM/2011-08-25-ldmia_ret.ll
>>>>    LLVM :: CodeGen/ARM/2012-08-30-select.ll
>>>>    LLVM :: CodeGen/X86/2012-11-30-handlemove-dbg.ll
>>>>    LLVM :: CodeGen/X86/fold-load.ll
>>>>    LLVM :: CodeGen/X86/hoist-common.ll
>>>>
>>>> Can someone more familiar with these help me work through them?
>>>>
>>>> A self-hosted copy of LLVM/Clang with this change passes its tests
>>>> (other than the ones above) and seems to build various applications
>>>> properly, so
>>>> I'm hopeful these just need to be updated to accommodate this change
>>>> and aren't signs of regressions or correctness issues.
>>>>
>>>> Please let me know if you have any questions/comments/feedback, and
>>>> thank you for your time.
>>>>
>>>> --------------------------------------------------------
>>>>
>>>> Per original comment, the intention of this loop
>>>> is to go ahead and break the critical edge
>>>> (in order to sink this instruction) if there's
>>>> reason to believe doing so might "unblock" the
>>>> sinking of additional instructions that define
>>>> registers used by this one.  The idea is that if
>>>> we have a few instructions to sink "together"
>>>> breaking the edge might be worthwhile.
>>>>
>>>> This commit makes a few small changes
>>>> to help better realize this goal:
>>>>
>>>> First, modify the loop to ignore registers
>>>> defined by this instruction.  We don't
>>>> sink definitions of physical registers,
>>>> and sinking an SSA definition isn't
>>>> going to unblock an upstream instruction.
>>>>
>>>> Second, ignore uses of physical registers.
>>>> Instructions that define physical registers are
>>>> rejected for sinking, and so moving this one
>>>> won't enable moving any defining instructions.
>>>> As an added bonus, while virtual register
>>>> use-def chains are generally small due
>>>> to SSA goodness, iteration over the uses
>>>> and definitions (used by hasOneNonDBGUse)
>>>> for physical registers like EFLAGS
>>>> can be rather expensive in practice.
>>>> (This is the original reason for looking at this)
>>>>
>>>> Finally, to keep things simple continue
>>>> to only consider this trick for registers that
>>>> have a single use (via hasOneNonDBGUse),
>>>> but to avoid spuriously breaking critical edges
>>>> only do so if the definition resides
>>>> in the same MBB and therefore this one directly
>>>> blocks it from being sunk as well.
>>>> If sinking them together is meant to be,
>>>> let the iterative nature of this pass
>>>> sink the definition into this block first.
>>>>
>>>> ~Will
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>