[PATCH] D19950: Use frequency info to guide Loop Invariant Code Motion.

Sun Jul 31 07:12:50 PDT 2016

I tried to improve the MachineSinking pass with Loop based sinking support.
But as we need to track all alias info for all BBs inside a loop.
Unfortunately AliasSetTracker does not support Machine level IR (I guess
that is why the original comment says it's not intended to replace
LLVM-IR-level sinking pass.) So I went ahead with Hal's original proposal:
adding an IR-level sinking pass before CGP.

Dehao

On Wed, May 25, 2016 at 11:42 AM, Hal Finkel <hfinkel at anl.gov> wrote:

>
> ------------------------------
>
> *From: *"Dehao Chen" <danielcdh at gmail.com>
> *To: *"Xinliang David Li" <davidxl at google.com>
> *Cc: *"Hal Finkel" <hfinkel at anl.gov>,
> reviews+D19950+public+38ba22078c2035b8 at reviews.llvm.org, "David Majnemer"
> <david.majnemer at gmail.com>, "Junbum Lim" <junbuml at codeaurora.org>,
> mcrosier at codeaurora.org, "llvm-commits" <llvm-commits at lists.llvm.org>,
> "amara emerson" <amara.emerson at arm.com>
> *Sent: *Tuesday, May 17, 2016 1:28:45 PM
> *Subject: *Re: [PATCH] D19950: Use frequency info to guide Loop Invariant
> Code Motion.
>
> I was trying to add a IR level sinking pass for this. But for my simple
> testcase:
>
> ; int g;
> ; int foo(int p, int x) {
> ;   for (int i = 0; i != x; i++)
> ;     if (__builtin_expect(i == p, 0)) {
> ;       x += g; x *= g;
> ;     }
> ;   return x;
> ; }
>
> After LICM hoists the load of global variable g, "x+=g;x*=g" is also
> hoisted to form a select statement in header:
>
> .lr.ph:                                           ; preds = %.lr.ph,
> %.lr.ph.preheader
>   %.03 = phi i32 [ %8, %.lr.ph ], [ 0, %.lr.ph.preheader ]
>   %.012 = phi i32 [ %.1, %.lr.ph ], [ %1, %.lr.ph.preheader ]
>   %5 = icmp eq i32 %.03, %0
> *  %6 = add nsw i32 %4, %.012*
> *  %7 = mul nsw i32 %6, %4*
> *  %.1 = select i1 %5, i32 %7, i32 %.012, !prof !1*
>   %8 = add nuw nsw i32 %.03, 1
>   %9 = icmp eq i32 %8, %.1
>   br i1 %9, label %._crit_edge, label %.lr.ph
>
> Later in CGP, it's expanded to:
>
> .lr.ph:                                           ; preds = %select.end,
> %.lr.ph.preheader
>   %.03 = phi i32 [ %8, %select.end ], [ 0, %.lr.ph.preheader ]
>   %.012 = phi i32 [ %.1, %select.end ], [ %1, %.lr.ph.preheader ]
>   %5 = icmp eq i32 %0, %.03
> *  %6 = add nsw i32 %.012, %4*
> *  %7 = mul nsw i32 %6, %4*
>   br i1 %5, label %select.end, label %select.false
>
> select.false:                                     ; preds = %.lr.ph
>   br label %select.end
>
> select.end:                                       ; preds = %.lr.ph,
> %select.false
> *  %.1 = phi i32 [ %7, %.lr.ph <http://lr.ph> ], [ %.012, %select.false ]*
>   %8 = add nuw nsw i32 %.03, 1
>   %9 = icmp eq i32 %8, %.1
>   br i1 %9, label %._crit_edge, label %.lr.ph
>
> So if we want to sink the load from preheader to the branch, we need to:
> 1. create a select.true basic block
> 2. sink %6 and %7 to that basic block
> 3. sink load instruction to that basic block
>
> This is doable, but seems doing a lot of redundant work of MachineSinking
> pass. And it could possibly affect optimizations between CGP and
> MachineSinking.
>
> Alternatively, we can also do the sinking once in MachineSinking, and make
> it handle more general cases like this one. But from the MachineSink.cpp, I
> saw the comment:
>
> // This pass is not intended to be a replacement or a complete alternative
> // for an LLVM-IR-level sinking pass. It is only designed to sink simple
> // constructs that are not exposed before lowering and instruction
> selection.
>
> Which confuses me: looks like we don't even have a IR-level sinking pass.
> Why would we want IR-level sinking pass instead of one single sinking pass
> at machine level?
>
> I think that removing this comment and improving the MachineSinking pass
> is certainly reasonable.
>
>  -Hal
>
>
> Thanks,
> Dehao
>
>
>
> On Tue, May 17, 2016 at 11:03 AM, Xinliang David Li <davidxl at google.com>
> wrote:
>
>>
>>
>> On Thu, May 12, 2016 at 6:37 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>
>>>
>>> ------------------------------
>>>
>>> *From: *"Xinliang David Li" <davidxl at google.com>
>>> *To: *"Hal Finkel" <hfinkel at anl.gov>
>>> *Cc: *"Dehao Chen" <danielcdh at gmail.com>,
>>> reviews+D19950+public+38ba22078c2035b8 at reviews.llvm.org, "David
>>> Majnemer" <david.majnemer at gmail.com>, "Junbum Lim" <
>>> junbuml at codeaurora.org>, mcrosier at codeaurora.org, "llvm-commits" <
>>> llvm-commits at lists.llvm.org>, "amara emerson" <amara.emerson at arm.com>
>>> *Sent: *Wednesday, May 11, 2016 12:58:24 PM
>>> *Subject: *Re: [PATCH] D19950: Use frequency info to guide Loop
>>> Invariant Code Motion.
>>>
>>>
>>>
>>> On Tue, May 10, 2016 at 3:14 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>>
>>>> ------------------------------
>>>>
>>>> > From: "Dehao Chen" <danielcdh at gmail.com>
>>>> > To: "Hal Finkel" <hfinkel at anl.gov>
>>>> > Cc: "Xinliang David Li" <davidxl at google.com>,
>>>> reviews+D19950+public+38ba22078c2035b8 at reviews.llvm.org, "David
>>>> > Majnemer" <david.majnemer at gmail.com>, "Junbum Lim" <
>>>> junbuml at codeaurora.org>, mcrosier at codeaurora.org, "llvm-commits"
>>>> > <llvm-commits at lists.llvm.org>, "amara emerson" <amara.emerson at arm.com
>>>> >
>>>> > Sent: Tuesday, May 10, 2016 4:10:49 PM
>>>> > Subject: Re: [PATCH] D19950: Use frequency info to guide Loop
>>>> Invariant Code Motion.
>>>> >
>>>> >
>>>> > Thanks for the comment. I spent quite a while to think, but still
>>>> > cannot think of an optimization that could be unblocked by
>>>> > speculatively hoisting an loop invariant from an unlikely executed
>>>> > path. Can you give some hint (or an example) on what type of
>>>> > optimization can benefit from this case?
>>>>
>>>> I'm specifically thinking about this case (although I suspect there are
>>>> others):
>>>>
>>>>  for (...) {
>>>>    if (...) {
>>>>      hoistable
>>>>      cold_stuff
>>>>    }
>>>>  }
>>>>
>>>>  for (...) {
>>>>    if (...) {
>>>>      hoistable
>>>>      hot_stuff
>>>>    }
>>>>  }
>>>>
>>>> I expect that 'hoistable' will be hoisted by LICM out of both loops,
>>>> and then CSE'd by GVN.
>>>
>>>
>>>
>>> I think this case can be/should be handled by more general profile
>>> driver speculative PRE.  The above case may not be profitable even after
>>> GVN CSEed two expressions.
>>>
>>> Why not? It is performance neutral and a code-size improvement.
>>>
>>
>>
>> It may not be performance neutral for the above case -- it depends on how
>> cold the if-block is.
>>
>>
>>
>>> .
>>>
>>> Loop unswitching could certainly be smarter about how it estimates the
>>> cost of the loop body, but if there are nested conditions, I don't think
>>> that always helps. For example, if we have this:
>>>
>>> for (...) {
>>>   if (c1) {
>>>     ...
>>>     if (c2) {
>>>       hoistable
>>>       cold_stuff
>>>     }
>>>     ...
>>>   }
>>>
>>>   ...
>>> }
>>>
>>> then the code-size cost of unswitching on condition c1 is rightly
>>> impacted by whether the hoistable code is still in the loop body.
>>>
>>>
>> Ok.
>>
>>
>>
>>>
>>>
>>>>
>>>> Another potential issue is that the hoistable code might be cold, and
>>>> relatively cheap to hoist, but expensive to vectorize. As a result, failing
>>>> to hoist the code might block otherwise-profitable vectorization. Which
>>>> reminds me, we need to fix the vectorizer's if-conversion heuristic to use
>>>> profiling information too ;)
>>>>
>>>
>>> SLP vectorize? Any example like this? Can vectorizor be enhanced so that
>>> it can be done in absence of the hoisting?
>>>
>>> I'm referring to the loop vectorizer. I'm not sure it is an issue of
>>> potential enhancement, the issue seems fundamental. The "cost" of
>>> if-conversion is proportional to the increased execution cost, and that's
>>> proportional to the number of instructions in the conditionally-executed
>>> block. Even more so if the hoistable operations would need to be scalarized
>>> under the targeted ISA. The more instructions need to be if-converted, and
>>> the colder the block, the lower the profitability of vectorizing (because,
>>> once vectorized, the instructions now execute on every iteration).
>>>
>>>
>> This is a reasonable concern.
>>
>>
>>> In any case, for the same reason we don't want to throttle LICM based on
>>> register pressure, I don't think we want to do it here either. There is
>>> value is having a canonical form where things execute as soon as possible,
>>> because it means that we can use dominance to determine the frontier of
>>> legal placement. Otherwise we need to teach a myriad of other passes how to
>>> hoist in combination with their core functionality. Maybe we'll come up
>>> with some fundamental reason why sinking late is insufficient, but I don't
>>> think I've seen one yet.
>>>
>>
>>
>> Ok.
>>
>> thanks,
>>
>> David
>>
>>
>>>
>>> Thanks again,
>>> Hal
>>>
>>>
>>> thanks,
>>>
>>> David
>>>
>>>
>>>>
>>>> Thanks again,
>>>> Hal
>>>>
>>>> >
>>>> > Thanks,
>>>> > Dehao
>>>> >
>>>> >
>>>> > On Tue, May 10, 2016 at 1:58 PM, Hal Finkel < hfinkel at anl.gov >
>>>> > wrote:
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > From: "Xinliang David Li" < davidxl at google.com >
>>>> > To: "Dehao Chen" < danielcdh at gmail.com >
>>>> > Cc: reviews+D19950+public+38ba22078c2035b8 at reviews.llvm.org , "David
>>>> > Majnemer" < david.majnemer at gmail.com >, "Hal Finkel" <
>>>> > hfinkel at anl.gov >, "Junbum Lim" < junbuml at codeaurora.org >,
>>>> > mcrosier at codeaurora.org , "llvm-commits" <
>>>> > llvm-commits at lists.llvm.org >, "amara emerson" <
>>>> > amara.emerson at arm.com >
>>>> > Sent: Tuesday, May 10, 2016 3:15:24 PM
>>>> > Subject: Re: [PATCH] D19950: Use frequency info to guide Loop
>>>> > Invariant Code Motion.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Tue, May 10, 2016 at 1:03 PM, Dehao Chen < danielcdh at gmail.com >
>>>> > wrote:
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Tue, May 10, 2016 at 11:48 AM, Xinliang David Li <
>>>> > davidxl at google.com > wrote:
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Tue, May 10, 2016 at 11:01 AM, Dehao Chen < danielcdh at gmail.com >
>>>> > wrote:
>>>> >
>>>> >
>>>> > danielcdh added a comment.
>>>> >
>>>> > In http://reviews.llvm.org/D19950#425287 , @hfinkel wrote:
>>>> >
>>>> > > In http://reviews.llvm.org/D19950#425286 , @hfinkel wrote:
>>>> > >
>>>> > > > In http://reviews.llvm.org/D19950#425285 , @davidxl wrote:
>>>> > > >
>>>> > > > > Static prediction has been conservative in estimating loop trip
>>>> > > > > count -- it produces something like 30ish iterations. If the a
>>>> > > > > very hot loop has a big if-then-else (or switch), it is very
>>>> > > > > likely to mark many bbs' to be colder than the loop header.
>>>> > > > > Turning on this for static prediction really depends on the
>>>> > > > > false rate. It seems to be this can get wrong pretty easily
>>>> > > > > for very hot loops (which is also the most important thing to
>>>> > > > > optimize for).
>>>> > > >
>>>> > > >
>>>> > > > This is a good point. There's no universal conservative choice
>>>> > > > (assuming a small trip count is conservative in some cases, and
>>>> > > > assuming a large trip count is conservative in other cases).
>>>> > >
>>>> > >
>>>> > > Would it be better (and practical) if there were some way for the
>>>> > > BFI client to specify which kind of 'conservative' is desired?
>>>> > >
>>>> > > Also, why are we doing this instead of sinking later (in CGP or
>>>> > > similar)? LICM can expose optimization opportunities, plus
>>>> > > represents a code pattern the user might input manually. Sinking
>>>> > > later seems more robust.
>>>> >
>>>> >
>>>> > I looked at CGP pass, looks like it's handling the sinking
>>>> > case-by-case (e.g. there is separate routine to handle sinking of
>>>> > load, gep, etc. I'm afraid this would miss opportunities.
>>>> > Additionally, the file-level comment of CGP pass says "This works
>>>> > around limitations in it's basic-block-at-a-time approach. It should
>>>> > eventually be removed."
>>>> > Yes, but it will be "removed" when the entire subsystem is replaced
>>>> > by GlobalISel, and we'll certainly need to make GlobalISel
>>>> > profiling-data aware, so I expect this is the right path forward
>>>> > regardless. I agree, however, that we want a general sinking here
>>>> > based on profiling data, not just the specific existing heuristics
>>>> > for loads, GEPs, etc.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Perhaps you can do profile driven sinking CGP separately to handle
>>>> > manually hoisted code situation mentioned by Hal.
>>>> >
>>>> >
>>>> > Do you mean we still use frequency to decide whether to hoist code in
>>>> > LICM, additionally use frequency info to check if we want to sink
>>>> > instructions in CGP?
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > yes -- that is the suggestion. I'd prefer that we try to sink late
>>>> > first, and only if there are use cases that we can't handle this
>>>> > way, we consider throttling hoisting early. If we come across such
>>>> > use cases, I'd like to understand them better. Hoisting can expose
>>>> > other optimization opportunities, and you lose those opportunities
>>>> > if you don't hoist in the first place.
>>>> >
>>>> > -Hal
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > David
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Dehao
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > David
>>>> >
>>>> >
>>>> >
>>>> > I'm not quite clear why it helps to move code out of loop early and
>>>> > later sink it inside. Could you give an example or some more
>>>> > context?
>>>> >
>>>> > Thanks,
>>>> > Dehao
>>>> >
>>>> >
>>>> > http://reviews.llvm.org/D19950
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> >
>>>> > Hal Finkel
>>>> > Assistant Computational Scientist
>>>> > Leadership Computing Facility
>>>> > Argonne National Laboratory
>>>> >
>>>> >
>>>>
>>>> --
>>>> Hal Finkel
>>>> Assistant Computational Scientist
>>>> Leadership Computing Facility
>>>> Argonne National Laboratory
>>>>
>>>
>>>
>>>
>>>
>>> --
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>
>>
>
>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160731/aaa0608f/attachment.html>