[PATCH] D19950: Use frequency info to guide Loop Invariant Code Motion.

Tue May 17 11:28:45 PDT 2016

I was trying to add a IR level sinking pass for this. But for my simple
testcase:

; int g;
; int foo(int p, int x) {
;   for (int i = 0; i != x; i++)
;     if (__builtin_expect(i == p, 0)) {
;       x += g; x *= g;
;     }
;   return x;
; }

After LICM hoists the load of global variable g, "x+=g;x*=g" is also
hoisted to form a select statement in header:

.lr.ph:                                           ; preds = %.lr.ph,
%.lr.ph.preheader
  %.03 = phi i32 [ %8, %.lr.ph ], [ 0, %.lr.ph.preheader ]
  %.012 = phi i32 [ %.1, %.lr.ph ], [ %1, %.lr.ph.preheader ]
  %5 = icmp eq i32 %.03, %0
*  %6 = add nsw i32 %4, %.012*
*  %7 = mul nsw i32 %6, %4*
*  %.1 = select i1 %5, i32 %7, i32 %.012, !prof !1*
  %8 = add nuw nsw i32 %.03, 1
  %9 = icmp eq i32 %8, %.1
  br i1 %9, label %._crit_edge, label %.lr.ph

Later in CGP, it's expanded to:

.lr.ph:                                           ; preds = %select.end,
%.lr.ph.preheader
  %.03 = phi i32 [ %8, %select.end ], [ 0, %.lr.ph.preheader ]
  %.012 = phi i32 [ %.1, %select.end ], [ %1, %.lr.ph.preheader ]
  %5 = icmp eq i32 %0, %.03
*  %6 = add nsw i32 %.012, %4*
*  %7 = mul nsw i32 %6, %4*
  br i1 %5, label %select.end, label %select.false

select.false:                                     ; preds = %.lr.ph
  br label %select.end

select.end:                                       ; preds = %.lr.ph,
%select.false
*  %.1 = phi i32 [ %7, %.lr.ph <http://lr.ph> ], [ %.012, %select.false ]*
  %8 = add nuw nsw i32 %.03, 1
  %9 = icmp eq i32 %8, %.1
  br i1 %9, label %._crit_edge, label %.lr.ph

So if we want to sink the load from preheader to the branch, we need to:
1. create a select.true basic block
2. sink %6 and %7 to that basic block
3. sink load instruction to that basic block

This is doable, but seems doing a lot of redundant work of MachineSinking
pass. And it could possibly affect optimizations between CGP and
MachineSinking.

Alternatively, we can also do the sinking once in MachineSinking, and make
it handle more general cases like this one. But from the MachineSink.cpp, I
saw the comment:

// This pass is not intended to be a replacement or a complete alternative
// for an LLVM-IR-level sinking pass. It is only designed to sink simple
// constructs that are not exposed before lowering and instruction
selection.

Which confuses me: looks like we don't even have a IR-level sinking pass.
Why would we want IR-level sinking pass instead of one single sinking pass
at machine level?

Thanks,
Dehao

On Tue, May 17, 2016 at 11:03 AM, Xinliang David Li <davidxl at google.com>
wrote:

>
>
> On Thu, May 12, 2016 at 6:37 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
>>
>> ------------------------------
>>
>> *From: *"Xinliang David Li" <davidxl at google.com>
>> *To: *"Hal Finkel" <hfinkel at anl.gov>
>> *Cc: *"Dehao Chen" <danielcdh at gmail.com>,
>> reviews+D19950+public+38ba22078c2035b8 at reviews.llvm.org, "David
>> Majnemer" <david.majnemer at gmail.com>, "Junbum Lim" <
>> junbuml at codeaurora.org>, mcrosier at codeaurora.org, "llvm-commits" <
>> llvm-commits at lists.llvm.org>, "amara emerson" <amara.emerson at arm.com>
>> *Sent: *Wednesday, May 11, 2016 12:58:24 PM
>> *Subject: *Re: [PATCH] D19950: Use frequency info to guide Loop
>> Invariant Code Motion.
>>
>>
>>
>> On Tue, May 10, 2016 at 3:14 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>
>>> ------------------------------
>>>
>>> > From: "Dehao Chen" <danielcdh at gmail.com>
>>> > To: "Hal Finkel" <hfinkel at anl.gov>
>>> > Cc: "Xinliang David Li" <davidxl at google.com>,
>>> reviews+D19950+public+38ba22078c2035b8 at reviews.llvm.org, "David
>>> > Majnemer" <david.majnemer at gmail.com>, "Junbum Lim" <
>>> junbuml at codeaurora.org>, mcrosier at codeaurora.org, "llvm-commits"
>>> > <llvm-commits at lists.llvm.org>, "amara emerson" <amara.emerson at arm.com>
>>> > Sent: Tuesday, May 10, 2016 4:10:49 PM
>>> > Subject: Re: [PATCH] D19950: Use frequency info to guide Loop
>>> Invariant Code Motion.
>>> >
>>> >
>>> > Thanks for the comment. I spent quite a while to think, but still
>>> > cannot think of an optimization that could be unblocked by
>>> > speculatively hoisting an loop invariant from an unlikely executed
>>> > path. Can you give some hint (or an example) on what type of
>>> > optimization can benefit from this case?
>>>
>>> I'm specifically thinking about this case (although I suspect there are
>>> others):
>>>
>>>  for (...) {
>>>    if (...) {
>>>      hoistable
>>>      cold_stuff
>>>    }
>>>  }
>>>
>>>  for (...) {
>>>    if (...) {
>>>      hoistable
>>>      hot_stuff
>>>    }
>>>  }
>>>
>>> I expect that 'hoistable' will be hoisted by LICM out of both loops, and
>>> then CSE'd by GVN.
>>
>>
>>
>> I think this case can be/should be handled by more general profile driver
>> speculative PRE.  The above case may not be profitable even after GVN CSEed
>> two expressions.
>>
>> Why not? It is performance neutral and a code-size improvement.
>>
>
>
> It may not be performance neutral for the above case -- it depends on how
> cold the if-block is.
>
>
>
>> .
>>
>> Loop unswitching could certainly be smarter about how it estimates the
>> cost of the loop body, but if there are nested conditions, I don't think
>> that always helps. For example, if we have this:
>>
>> for (...) {
>>   if (c1) {
>>     ...
>>     if (c2) {
>>       hoistable
>>       cold_stuff
>>     }
>>     ...
>>   }
>>
>>   ...
>> }
>>
>> then the code-size cost of unswitching on condition c1 is rightly
>> impacted by whether the hoistable code is still in the loop body.
>>
>>
> Ok.
>
>
>
>>
>>
>>>
>>> Another potential issue is that the hoistable code might be cold, and
>>> relatively cheap to hoist, but expensive to vectorize. As a result, failing
>>> to hoist the code might block otherwise-profitable vectorization. Which
>>> reminds me, we need to fix the vectorizer's if-conversion heuristic to use
>>> profiling information too ;)
>>>
>>
>> SLP vectorize? Any example like this? Can vectorizor be enhanced so that
>> it can be done in absence of the hoisting?
>>
>> I'm referring to the loop vectorizer. I'm not sure it is an issue of
>> potential enhancement, the issue seems fundamental. The "cost" of
>> if-conversion is proportional to the increased execution cost, and that's
>> proportional to the number of instructions in the conditionally-executed
>> block. Even more so if the hoistable operations would need to be scalarized
>> under the targeted ISA. The more instructions need to be if-converted, and
>> the colder the block, the lower the profitability of vectorizing (because,
>> once vectorized, the instructions now execute on every iteration).
>>
>>
> This is a reasonable concern.
>
>
>> In any case, for the same reason we don't want to throttle LICM based on
>> register pressure, I don't think we want to do it here either. There is
>> value is having a canonical form where things execute as soon as possible,
>> because it means that we can use dominance to determine the frontier of
>> legal placement. Otherwise we need to teach a myriad of other passes how to
>> hoist in combination with their core functionality. Maybe we'll come up
>> with some fundamental reason why sinking late is insufficient, but I don't
>> think I've seen one yet.
>>
>
>
> Ok.
>
> thanks,
>
> David
>
>
>>
>> Thanks again,
>> Hal
>>
>>
>> thanks,
>>
>> David
>>
>>
>>>
>>> Thanks again,
>>> Hal
>>>
>>> >
>>> > Thanks,
>>> > Dehao
>>> >
>>> >
>>> > On Tue, May 10, 2016 at 1:58 PM, Hal Finkel < hfinkel at anl.gov >
>>> > wrote:
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > From: "Xinliang David Li" < davidxl at google.com >
>>> > To: "Dehao Chen" < danielcdh at gmail.com >
>>> > Cc: reviews+D19950+public+38ba22078c2035b8 at reviews.llvm.org , "David
>>> > Majnemer" < david.majnemer at gmail.com >, "Hal Finkel" <
>>> > hfinkel at anl.gov >, "Junbum Lim" < junbuml at codeaurora.org >,
>>> > mcrosier at codeaurora.org , "llvm-commits" <
>>> > llvm-commits at lists.llvm.org >, "amara emerson" <
>>> > amara.emerson at arm.com >
>>> > Sent: Tuesday, May 10, 2016 3:15:24 PM
>>> > Subject: Re: [PATCH] D19950: Use frequency info to guide Loop
>>> > Invariant Code Motion.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, May 10, 2016 at 1:03 PM, Dehao Chen < danielcdh at gmail.com >
>>> > wrote:
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, May 10, 2016 at 11:48 AM, Xinliang David Li <
>>> > davidxl at google.com > wrote:
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, May 10, 2016 at 11:01 AM, Dehao Chen < danielcdh at gmail.com >
>>> > wrote:
>>> >
>>> >
>>> > danielcdh added a comment.
>>> >
>>> > In http://reviews.llvm.org/D19950#425287 , @hfinkel wrote:
>>> >
>>> > > In http://reviews.llvm.org/D19950#425286 , @hfinkel wrote:
>>> > >
>>> > > > In http://reviews.llvm.org/D19950#425285 , @davidxl wrote:
>>> > > >
>>> > > > > Static prediction has been conservative in estimating loop trip
>>> > > > > count -- it produces something like 30ish iterations. If the a
>>> > > > > very hot loop has a big if-then-else (or switch), it is very
>>> > > > > likely to mark many bbs' to be colder than the loop header.
>>> > > > > Turning on this for static prediction really depends on the
>>> > > > > false rate. It seems to be this can get wrong pretty easily
>>> > > > > for very hot loops (which is also the most important thing to
>>> > > > > optimize for).
>>> > > >
>>> > > >
>>> > > > This is a good point. There's no universal conservative choice
>>> > > > (assuming a small trip count is conservative in some cases, and
>>> > > > assuming a large trip count is conservative in other cases).
>>> > >
>>> > >
>>> > > Would it be better (and practical) if there were some way for the
>>> > > BFI client to specify which kind of 'conservative' is desired?
>>> > >
>>> > > Also, why are we doing this instead of sinking later (in CGP or
>>> > > similar)? LICM can expose optimization opportunities, plus
>>> > > represents a code pattern the user might input manually. Sinking
>>> > > later seems more robust.
>>> >
>>> >
>>> > I looked at CGP pass, looks like it's handling the sinking
>>> > case-by-case (e.g. there is separate routine to handle sinking of
>>> > load, gep, etc. I'm afraid this would miss opportunities.
>>> > Additionally, the file-level comment of CGP pass says "This works
>>> > around limitations in it's basic-block-at-a-time approach. It should
>>> > eventually be removed."
>>> > Yes, but it will be "removed" when the entire subsystem is replaced
>>> > by GlobalISel, and we'll certainly need to make GlobalISel
>>> > profiling-data aware, so I expect this is the right path forward
>>> > regardless. I agree, however, that we want a general sinking here
>>> > based on profiling data, not just the specific existing heuristics
>>> > for loads, GEPs, etc.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Perhaps you can do profile driven sinking CGP separately to handle
>>> > manually hoisted code situation mentioned by Hal.
>>> >
>>> >
>>> > Do you mean we still use frequency to decide whether to hoist code in
>>> > LICM, additionally use frequency info to check if we want to sink
>>> > instructions in CGP?
>>> >
>>> >
>>> >
>>> >
>>> > yes -- that is the suggestion. I'd prefer that we try to sink late
>>> > first, and only if there are use cases that we can't handle this
>>> > way, we consider throttling hoisting early. If we come across such
>>> > use cases, I'd like to understand them better. Hoisting can expose
>>> > other optimization opportunities, and you lose those opportunities
>>> > if you don't hoist in the first place.
>>> >
>>> > -Hal
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > David
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Dehao
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > David
>>> >
>>> >
>>> >
>>> > I'm not quite clear why it helps to move code out of loop early and
>>> > later sink it inside. Could you give an example or some more
>>> > context?
>>> >
>>> > Thanks,
>>> > Dehao
>>> >
>>> >
>>> > http://reviews.llvm.org/D19950
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> >
>>> > Hal Finkel
>>> > Assistant Computational Scientist
>>> > Leadership Computing Facility
>>> > Argonne National Laboratory
>>> >
>>> >
>>>
>>> --
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>
>>
>>
>>
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160517/a166a46e/attachment.html>