[PATCH] MachineSink: Fix and tweak critical-edge breaking heuristic.

Tue Sep 10 18:24:38 PDT 2013

Apologies for not better motivating this in my original post,
I'd already spent more time on this than I probably should (dang
interesting compilers!)
and was hoping what I had would be sufficient.

Since no one's responded yet I'll try to better explain the issue
and why this is important.  Thoughts/comments welcome :).

Here's a before-and-after stats and timings for the regression I'm observing:

Before: https://pastee.org/3w6fa (28m47s)
After: https://pastee.org/6hx3u (3m49s)

The code in question is an instrumented copy of some image
manipulation code from ImageMagick.
I'd rather not post it as it was generated by an unpublished research
project, but would be happy
to provide it upon request to anyone interested in looking into this issue.

That said, the BC probably isn't required for understanding why the
existing code is problematic:

The timings above show something certainly isn't working properly
in terms of scalability (look at the MachineSink timings!)
and I walk through the issue in more detail below.
Regardless of my patch's details hopefully this will shed
some light on the situation and help get this resolved.

In short, the heuristic in isWorthBreakingCriticalEdge() is O(n) in the number
of instructions within the function that define registers used by the
instruction
being queried.  This leads to unscalable behavior when the number of candidate
cheap instructions for sinking is particularly high, potentially O(n^2) overall.

Looking deeper, it seems the hasOneNonDBGUse() is expensive when the register
has many definitions as it has to walk through all of the definitions
before reaching a use.
This happens with EFLAGS easily since many instructions touch it--
on my problematic IR this results in walking over 300k operands for each
call to hasOneNonDBGUse().

>From this alone I'd suggest the code be changed to avoid this
unscalable behavior, especially since the function is already labelled
as using questionable heuristics.  Note that simply avoiding calling
hasOneNonDBGUse() on physical registers is an effective solution
since virtual registers will never have more than one definition
due to their SSA-like nature.  Furthermore, I argue that ignoring both
uses and definitions physical registers is acceptable (see below).

While investigating the scalability issue, it appears
the loop does not do what it was originally intended to do,
judging by the comment above it (reproduced below):

-------------------
// MI is cheap, we probably don't want to break the critical edge for it.
// However, if this would allow some definitions of its source operands
// to be sunk then it's probably worth it.
-------------------

Which doesn't seem to be what the loop presently does:

* It considers registers /defined/ by this instruction.  While sinking
an instruction that defines a register with a single use might
be helpful for other reasons (reducing the live range),
this isn't what the comment claims the purpose of the loop is.
* Physical register uses aren't interesting because MachineSink
refuses to sink instructions that define physical registers.  This means
sinking the use of a physical register will not unblock its definition
since the definition is incapable of being sunk.
* It makes no attempt to determine if the definition instruction
is suitable for sinking into the broken critical edge in the first place.
I improve this somewhat by only recognizing an instruction as a "blocker"
if the defining instruction is in the same MBB.  This is incomplete of course,
but helps avoid spuriously breaking critical edges for instructions we know
can't be sunk with this one.

Hope this helps clarify and explain the situation.  Honestly the right answer
might lie elsewhere (perhaps in a more thorough reworking of the pass),
but unfortunately that's more than I have time for presently :(.

Thoughts/comments welcome! :)

~Will

On Sat, Sep 7, 2013 at 10:51 PM, Will Dietz <w at wdtz.org> wrote:
> See attached, introduction is below and a description from commit
> message is repeated at the end of this message for easier processing
> and discussion.
>
> This fixes a serious compile-time performance issue for me and hopefully
> causes LLVM to generate slightly nicer code.  Unfortunately I'm not in
> a position
> to do appropriate performance or platform testing of this (perhaps
> minor) change, so please let me know how I can best proceed or if
> that's something
> to leave to nightly testing after the change is finished and passes review.
>
> Unfortunately, this breaks a few lit tests in ways that aren't
> obviously problematic upon inspection, but it's unclear to me how to
> properly resolve them.  The tests in question are:
>
>     LLVM :: CodeGen/ARM/2011-04-11-MachineLICMBug.ll
>     LLVM :: CodeGen/ARM/2011-08-25-ldmia_ret.ll
>     LLVM :: CodeGen/ARM/2012-08-30-select.ll
>     LLVM :: CodeGen/X86/2012-11-30-handlemove-dbg.ll
>     LLVM :: CodeGen/X86/fold-load.ll
>     LLVM :: CodeGen/X86/hoist-common.ll
>
> Can someone more familiar with these help me work through them?
>
> A self-hosted copy of LLVM/Clang with this change passes its tests
> (other than the ones above) and seems to build various applications
> properly, so
> I'm hopeful these just need to be updated to accommodate this change
> and aren't signs of regressions or correctness issues.
>
> Please let me know if you have any questions/comments/feedback, and
> thank you for your time.
>
> --------------------------------------------------------
>
> Per original comment, the intention of this loop
> is to go ahead and break the critical edge
> (in order to sink this instruction) if there's
> reason to believe doing so might "unblock" the
> sinking of additional instructions that define
> registers used by this one.  The idea is that if
> we have a few instructions to sink "together"
> breaking the edge might be worthwhile.
>
> This commit makes a few small changes
> to help better realize this goal:
>
> First, modify the loop to ignore registers
> defined by this instruction.  We don't
> sink definitions of physical registers,
> and sinking an SSA definition isn't
> going to unblock an upstream instruction.
>
> Second, ignore uses of physical registers.
> Instructions that define physical registers are
> rejected for sinking, and so moving this one
> won't enable moving any defining instructions.
> As an added bonus, while virtual register
> use-def chains are generally small due
> to SSA goodness, iteration over the uses
> and definitions (used by hasOneNonDBGUse)
> for physical registers like EFLAGS
> can be rather expensive in practice.
> (This is the original reason for looking at this)
>
> Finally, to keep things simple continue
> to only consider this trick for registers that
> have a single use (via hasOneNonDBGUse),
> but to avoid spuriously breaking critical edges
> only do so if the definition resides
> in the same MBB and therefore this one directly
> blocks it from being sunk as well.
> If sinking them together is meant to be,
> let the iterative nature of this pass
> sink the definition into this block first.
>
> ~Will