[PATCH] MachineSink: Fix and tweak critical-edge breaking heuristic.

Fri Oct 11 22:16:09 PDT 2013

On Oct 11, 2013, at 5:49 PM, Will Dietz <w at wdtz.org> wrote:

> Updated patch attached.
> 
> Fixed the select test (although still not sure what it's supposed to
> be testing exactly)
> and XFAIL'ing the hoist-common test since it's broken as-is regardless
> of this change.
> 
> Also, added a functionality test where the new heuristic results in nicer code.
> 
> FWIW it seems other changes in mainline have hidden the performance
> issue encountered in
> the original bug report[1] but pushing for this change anyway since it seems
> better for the reasons mentioned in this discussion and the commit message
> (and it's possible the underlying issue will crop up later if not addressed).
> 
> Thank you for your time!
> 
> ~Will
> 
> [1] http://llvm.org/PR17245

Thanks for pursuing this. Sorry I didn’t respond last time. This looks fine to me. I think you’ve done as much as can be expected to keep the unit tests functioning.

With your patch, I see a failure in misched-balance.ll @encpc1 (strange that you didn’t see it). You are correctly not sinking as much code, but we have extra register pressure. I think you add a couple CHECK lines to allow for a single callee-save spill but no more:

+; CHECK: push
; CHECK-NOT: push
+; CHECK: pop
; CHECK-NOT: pop

Thanks!

-Andy

> On Tue, Sep 17, 2013 at 5:14 PM, Will Dietz <w at wdtz.org> wrote:
>> (Renato, adding you to cc per discussion on the bugzilla)
>> 
>> Thank you both for helping moving this forward.
>> 
>> Attached is an updated patch that
>> 
>> a) Fixes comment text (sorry!)
>> b) Resolves most of the lit tests.
>> 
>> The remaining two tests that fail are:
>> 
>> Failing Tests (2):
>>    LLVM :: CodeGen/ARM/2012-08-30-select.ll
>>    LLVM :: CodeGen/X86/hoist-common.ll
>> 
>> Explanation follow:
>> 
>> [CodeGen/ARM/2012-08-30-select.ll]
>> It's unclear what this test is attempting to test, so I'm not sure if
>> it's broken or not :/.
>> Additionally it seems unclear which assembly is more desirable for
>> LLVM to produce in this case.
>> Would pinging the test's author be appropriate here?
>> 
>> 
>> [CodeGen/X86/hoist-common.ll]
>> The more I look into this one the more it seems that the originally
>> tested functionality hasn't been
>> working as expected for a while.  For reference, here's the commit
>> introducing this test (commit message is relevant):
>> 
>>  http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20110509/120636.html
>> 
>> The assembly produced by ToT on my machine is as follows:
>> 
>> hoist-common-before-patch:
>> ---------------------------
>>        .section        __TEXT,__text,regular,pure_instructions
>>        .globl  _t
>>        .align  4, 0x90
>> _t:                                     ## @t
>> ## BB#0:                                ## %entry
>>        pushq   %rax
>>        xorl    %eax, %eax
>>        testl   %edi, %edi
>>        je      LBB0_2
>> ## BB#1:                                ## %if.then
>>        xorl    %eax, %eax
>>        callq   _foo
>>                                        ## kill: AL<def> AL<kill> EAX<def>
>> LBB0_2:                                 ## %return
>>                                        ## kill: AL<def> AL<kill> EAX<kill>
>>        popq    %rdx
>>        ret
>> ---------------------------
>> 
>> which is leaving the redundant 'xorl' before the call to foo.
>> 
>> Adding some debug prints to BranchFolding suggests it's not triggering
>> at all on ToT for this code,
>> but with my patch (which produces the assembly given below) it
>> considers hoisting the two
>> MOV32r0's but punts after concluding the two are not identical.  FWIW,
>> at the MI level, they look like this:
>> 
>>  %EAX<def,dead> = MOV32r0 %EFLAGS<imp-def,dead>, %AL<imp-def>
>> 
>> and
>> 
>>  %EAX<def> = MOV32r0 %EFLAGS<imp-def,dead>
>> 
>> As promised, here's the assembly with my patch:
>> 
>> hoist-common-after-patch:
>> ---------------------------
>>        .section        __TEXT,__text,regular,pure_instructions
>>        .globl  _t
>>        .align  4, 0x90
>> _t:                                     ## @t
>> ## BB#0:                                ## %entry
>>        pushq   %rax
>>        testl   %edi, %edi
>>        je      LBB0_1
>> ## BB#2:                                ## %if.then
>>        xorl    %eax, %eax
>>        callq   _foo
>>                                        ## kill: AL<def> AL<kill> EAX<def>
>>        jmp     LBB0_3
>> LBB0_1:
>>        xorl    %eax, %eax
>> LBB0_3:                                 ## %return
>>                                        ## kill: AL<def> AL<kill> EAX<kill>
>>        popq    %rdx
>>        ret
>> ---------------------------
>> 
>> In short, something is wrong but it doesn't seem to be due to
>> MachineSink's behavior.
>> 
>> Thoughts/comments on how to best proceed would be appreciated! :)
>> 
>> Thank you for your time,
>> 
>> ~Will
>> 
>> PS I lurk on #llvm as dtzWill if you're on IRC, that might help get
>> this resolved more quickly :).
>> 
>> On Tue, Sep 17, 2013 at 3:49 PM, Will Dietz <w at wdtz.org> wrote:
>>> Thanks for your response, Andrew!
>>> 
>>> Glad I'm on the right track :).
>>> 
>>> Regarding the comment, let me try to explain again and if that seems agreeable
>>> I'll update the patch so it's understandable:
>>> 
>>> The heuristic here is primarily trying to reason about whether this cheap
>>> instruction blocks another instruction from being sunk as well.  We want
>>> at least two instructions to merit splitting the critical edge.
>>> 
>>> To determine this, we check:
>>> 
>>> * The definition has only one use.
>>> 
>>> I admint to making use of this partially because of its use elsewhere
>>> in this code,
>>> but the primary reasoning is that it looks like a cheap property to check
>>> to ensure the definition would still dominate all its uses if it was moved along
>>> with this use.  If there's just the single use then moving them together
>>> trivially satisfies this.  If there are more uses this becomes more complicated.
>>> 
>>> * Definition lives in same MBB
>>> 
>>> This is what the quoted comment is attempting to justify.  This relies
>>> on the fact that
>>> MachineSink only moves instructions into direct successors, iterating
>>> to move instructions
>>> further than that (and taking advantage of other opportunities as it
>>> goes).  Since we're looking
>>> for cases where moving the definition with the use is possible, they must reside
>>> in the same block for that to be possible this iteration.
>>> 
>>> As an aside, if moving this use were to lead to MachineSink moving the
>>> definition
>>> into the new split-edge block in future iterations I believe it's
>>> necessarily the case that
>>> MachineSink in future iterations will also move the definition into *this* block
>>> without us moving the use.  Using this property saves us from more expensive
>>> reasoning wihile still handling some more complicated situations nicely
>>> (and avoiding unnecessarily breaking the edge for cases where the
>>> definition cannot be sunk into this block first).
>>> 
>>> This is somewhat weak as it becomes invalidated if/when MachineSink
>>> is improved to move instructions farther than immediate successors.
>>> However I think if that were to happen it'd be enough of a rewrite
>>> that this heuristic would be unlikely to still be used at all.
>>> 
>>> -----------------------
>>> 
>>> Having just written all this up I see now that my comment is actually
>>> just plain malformed, not necessarily that the ideas behind it
>>> were misunderstood or needed explaining.  Sorry about that,
>>> sending along anyway for sanity checking :).
>>> 
>>> Does the above make sense and sound reasonable?
>>> 
>>> I'll take a look at the tests...
>>> 
>>> (Have you seen the discussion on the bug report?)
>>> 
>>> Thanks!
>>> 
>>> ~Will
>>> 
>>> 
>>> On Tue, Sep 17, 2013 at 1:12 AM, Andrew Trick <atrick at apple.com> wrote:
>>>> 
>>>> On Sep 14, 2013, at 2:53 PM, Will Dietz <w at wdtz.org> wrote:
>>>> 
>>>>> Managed to reduce the IR, attached to newly filed PR17245
>>>>> (http://llvm.org/PR17245 ) for discussion and investigation.
>>>> 
>>>> This looks great. Thanks Will.
>>>> 
>>>> I can't quite parse this comment.
>>>> 
>>>> +      // reason about blocking it.  Other iterations
>>>> +      // will look to sink it block and block until
>>>> +      // it resides in this MBB at which point
>>>> +      // this will evaluate to true.
>>>> 
>>>> Regarding the test failures. I think you should determine whether:
>>>> (1) MachineSinking is now doing the intended thing
>>>> (2) We're not breaking any codegen patterns (downstream passes still do something reasonable).
>>>> (3) The test is still testing something.
>>>> 
>>>> If all of the above, then just fixup the CHECK lines.
>>>> 
>>>> For example, in CodeGen/ARM/2011-04-11-MachineLICMBug.ll, you're now sinking multiple instructions into the loop's exit paths. I think that's intended behavior since we consider multiple instructions to be profitable for splitting. The final code looks better. And the test still checks that LICM doesn't generate extra copies. It looks like the CHECK labels could use some adjustment.
>>>> 
>>>> -Andy
>>>> 
>>>>> 
>>>>> On Tue, Sep 10, 2013 at 8:24 PM, Will Dietz <w at wdtz.org> wrote:
>>>>>> Apologies for not better motivating this in my original post,
>>>>>> I'd already spent more time on this than I probably should (dang
>>>>>> interesting compilers!)
>>>>>> and was hoping what I had would be sufficient.
>>>>>> 
>>>>>> Since no one's responded yet I'll try to better explain the issue
>>>>>> and why this is important.  Thoughts/comments welcome :).
>>>>>> 
>>>>>> Here's a before-and-after stats and timings for the regression I'm observing:
>>>>>> 
>>>>>> Before: https://pastee.org/3w6fa (28m47s)
>>>>>> After: https://pastee.org/6hx3u (3m49s)
>>>>>> 
>>>>>> The code in question is an instrumented copy of some image
>>>>>> manipulation code from ImageMagick.
>>>>>> I'd rather not post it as it was generated by an unpublished research
>>>>>> project, but would be happy
>>>>>> to provide it upon request to anyone interested in looking into this issue.
>>>>>> 
>>>>>> That said, the BC probably isn't required for understanding why the
>>>>>> existing code is problematic:
>>>>>> 
>>>>>> The timings above show something certainly isn't working properly
>>>>>> in terms of scalability (look at the MachineSink timings!)
>>>>>> and I walk through the issue in more detail below.
>>>>>> Regardless of my patch's details hopefully this will shed
>>>>>> some light on the situation and help get this resolved.
>>>>>> 
>>>>>> In short, the heuristic in isWorthBreakingCriticalEdge() is O(n) in the number
>>>>>> of instructions within the function that define registers used by the
>>>>>> instruction
>>>>>> being queried.  This leads to unscalable behavior when the number of candidate
>>>>>> cheap instructions for sinking is particularly high, potentially O(n^2) overall.
>>>>>> 
>>>>>> Looking deeper, it seems the hasOneNonDBGUse() is expensive when the register
>>>>>> has many definitions as it has to walk through all of the definitions
>>>>>> before reaching a use.
>>>>>> This happens with EFLAGS easily since many instructions touch it--
>>>>>> on my problematic IR this results in walking over 300k operands for each
>>>>>> call to hasOneNonDBGUse().
>>>>>> 
>>>>>> From this alone I'd suggest the code be changed to avoid this
>>>>>> unscalable behavior, especially since the function is already labelled
>>>>>> as using questionable heuristics.  Note that simply avoiding calling
>>>>>> hasOneNonDBGUse() on physical registers is an effective solution
>>>>>> since virtual registers will never have more than one definition
>>>>>> due to their SSA-like nature.  Furthermore, I argue that ignoring both
>>>>>> uses and definitions physical registers is acceptable (see below).
>>>>>> 
>>>>>> While investigating the scalability issue, it appears
>>>>>> the loop does not do what it was originally intended to do,
>>>>>> judging by the comment above it (reproduced below):
>>>>>> 
>>>>>> -------------------
>>>>>> // MI is cheap, we probably don't want to break the critical edge for it.
>>>>>> // However, if this would allow some definitions of its source operands
>>>>>> // to be sunk then it's probably worth it.
>>>>>> -------------------
>>>>>> 
>>>>>> Which doesn't seem to be what the loop presently does:
>>>>>> 
>>>>>> * It considers registers /defined/ by this instruction.  While sinking
>>>>>> an instruction that defines a register with a single use might
>>>>>> be helpful for other reasons (reducing the live range),
>>>>>> this isn't what the comment claims the purpose of the loop is.
>>>>>> * Physical register uses aren't interesting because MachineSink
>>>>>> refuses to sink instructions that define physical registers.  This means
>>>>>> sinking the use of a physical register will not unblock its definition
>>>>>> since the definition is incapable of being sunk.
>>>>>> * It makes no attempt to determine if the definition instruction
>>>>>> is suitable for sinking into the broken critical edge in the first place.
>>>>>> I improve this somewhat by only recognizing an instruction as a "blocker"
>>>>>> if the defining instruction is in the same MBB.  This is incomplete of course,
>>>>>> but helps avoid spuriously breaking critical edges for instructions we know
>>>>>> can't be sunk with this one.
>>>>>> 
>>>>>> Hope this helps clarify and explain the situation.  Honestly the right answer
>>>>>> might lie elsewhere (perhaps in a more thorough reworking of the pass),
>>>>>> but unfortunately that's more than I have time for presently :(.
>>>>>> 
>>>>>> Thoughts/comments welcome! :)
>>>>>> 
>>>>>> ~Will
>>>>>> 
>>>>>> 
>>>>>> On Sat, Sep 7, 2013 at 10:51 PM, Will Dietz <w at wdtz.org> wrote:
>>>>>>> See attached, introduction is below and a description from commit
>>>>>>> message is repeated at the end of this message for easier processing
>>>>>>> and discussion.
>>>>>>> 
>>>>>>> This fixes a serious compile-time performance issue for me and hopefully
>>>>>>> causes LLVM to generate slightly nicer code.  Unfortunately I'm not in
>>>>>>> a position
>>>>>>> to do appropriate performance or platform testing of this (perhaps
>>>>>>> minor) change, so please let me know how I can best proceed or if
>>>>>>> that's something
>>>>>>> to leave to nightly testing after the change is finished and passes review.
>>>>>>> 
>>>>>>> Unfortunately, this breaks a few lit tests in ways that aren't
>>>>>>> obviously problematic upon inspection, but it's unclear to me how to
>>>>>>> properly resolve them.  The tests in question are:
>>>>>>> 
>>>>>>>   LLVM :: CodeGen/ARM/2011-04-11-MachineLICMBug.ll
>>>>>>>   LLVM :: CodeGen/ARM/2011-08-25-ldmia_ret.ll
>>>>>>>   LLVM :: CodeGen/ARM/2012-08-30-select.ll
>>>>>>>   LLVM :: CodeGen/X86/2012-11-30-handlemove-dbg.ll
>>>>>>>   LLVM :: CodeGen/X86/fold-load.ll
>>>>>>>   LLVM :: CodeGen/X86/hoist-common.ll
>>>>>>> 
>>>>>>> Can someone more familiar with these help me work through them?
>>>>>>> 
>>>>>>> A self-hosted copy of LLVM/Clang with this change passes its tests
>>>>>>> (other than the ones above) and seems to build various applications
>>>>>>> properly, so
>>>>>>> I'm hopeful these just need to be updated to accommodate this change
>>>>>>> and aren't signs of regressions or correctness issues.
>>>>>>> 
>>>>>>> Please let me know if you have any questions/comments/feedback, and
>>>>>>> thank you for your time.
>>>>>>> 
>>>>>>> --------------------------------------------------------
>>>>>>> 
>>>>>>> Per original comment, the intention of this loop
>>>>>>> is to go ahead and break the critical edge
>>>>>>> (in order to sink this instruction) if there's
>>>>>>> reason to believe doing so might "unblock" the
>>>>>>> sinking of additional instructions that define
>>>>>>> registers used by this one.  The idea is that if
>>>>>>> we have a few instructions to sink "together"
>>>>>>> breaking the edge might be worthwhile.
>>>>>>> 
>>>>>>> This commit makes a few small changes
>>>>>>> to help better realize this goal:
>>>>>>> 
>>>>>>> First, modify the loop to ignore registers
>>>>>>> defined by this instruction.  We don't
>>>>>>> sink definitions of physical registers,
>>>>>>> and sinking an SSA definition isn't
>>>>>>> going to unblock an upstream instruction.
>>>>>>> 
>>>>>>> Second, ignore uses of physical registers.
>>>>>>> Instructions that define physical registers are
>>>>>>> rejected for sinking, and so moving this one
>>>>>>> won't enable moving any defining instructions.
>>>>>>> As an added bonus, while virtual register
>>>>>>> use-def chains are generally small due
>>>>>>> to SSA goodness, iteration over the uses
>>>>>>> and definitions (used by hasOneNonDBGUse)
>>>>>>> for physical registers like EFLAGS
>>>>>>> can be rather expensive in practice.
>>>>>>> (This is the original reason for looking at this)
>>>>>>> 
>>>>>>> Finally, to keep things simple continue
>>>>>>> to only consider this trick for registers that
>>>>>>> have a single use (via hasOneNonDBGUse),
>>>>>>> but to avoid spuriously breaking critical edges
>>>>>>> only do so if the definition resides
>>>>>>> in the same MBB and therefore this one directly
>>>>>>> blocks it from being sunk as well.
>>>>>>> If sinking them together is meant to be,
>>>>>>> let the iterative nature of this pass
>>>>>>> sink the definition into this block first.
>>>>>>> 
>>>>>>> ~Will
>>>>> _______________________________________________
>>>>> llvm-commits mailing list
>>>>> llvm-commits at cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>> 
> <0001-MachineSink-Fix-and-tweak-critical-edge-breaking-heu.patch>