[LLVMdev] Separating loop nests based on profile information?

Mon Jan 12 12:38:31 PST 2015

On Mon, Jan 12, 2015 at 12:48 PM, Philip Reames <listmail at philipreames.com>
wrote:

>
> On 01/11/2015 10:17 PM, Daniel Berlin wrote:
>
>
>
> On Thu, Jan 8, 2015 at 3:33 PM, Philip Reames <listmail at philipreames.com>
> wrote:
>
>>
>> On 01/07/2015 05:33 PM, Chandler Carruth wrote:
>>
>>> How does this compare with classical approaches of loop peeling,
>>> partitioning, fission, or whatever you might call it?
>>>
>>  I'm still developing a good sense for this, but let me lay out some
>> observations.  Fair warning, this is going to be long and rambling.
>>
>> Let's start with a toy example:
>> while(c) {
>>   x = this->x;
>>   y = this->y;
>>   if (x == y) {
>>     rare();
>>   }
>> }
>>
>>
>
>
>> If we could tell x and y were loop invariant, we could unswitch this
>> loop.  However, the rare call clobbers our view of memory, so LICM fails,
>> and thus unswitch fails.
>>
>>
>
>> We'd like to apply PRE here and push the reload into the loop preheader
>> and the rare case.  This currently fails for two reasons: 1) We don't allow
>> code duplication during PRE,
>
>
>  ?????
> If we don't, we aren't doing real PRE. So i'm not entirely sure what you
> mean here.
>
> As I think you comment elsewhere in your response, we currently do not
> duplicate loads; we only move them.  The current structure is based around
> finding a legal point which doesn't introduce a load along any path where
> one didn't exist previously.  If we have a path which has two copies of the
> load, we try to find a place to move one of them such that all paths still
> have the available load and we've removed the extra load along the one
> path.
>
> (Note: The above explanation is rough and does not parallel how the code
> is organized.)
>
>
>
>
>
>
>> and 2) the canonical loop-simplify form of having a single latch means
>> that PRE doesn't actually see the rare block at all, it sees the preheader
>> and the join point after the if block.
>
>
>
>
>>
>> I think both of the previous are fixable:
>>
>
>
>  GCC's PRE already does the above.
> It doesn't do profile guided duplication.
> We aren't doing anything special with these blocks.
>
> here is the code I used to test:
>
>  struct a{
>  int x;
>  int y;
> };
> extern void rare();
> int mai(int c, struct a *this)
> {
>  int d = 0;
>  while(c) {
>  int x = this->x;
>  int y = this->y;
>  d += x + y;
>  if (x == y) {
>  rare();
>  }
>  }
>  return d;
> }
>
>
>  It will do exactly what you expect, it is transformed into:
>
>  struct a{
>  int x;
>  int y;
> };
> extern void rare();
> int mai(int c, struct a *this)
> {
>  int d = 0;
>         int pretemp1 = this->x
>         int pretemp2 = this->y
>
>  while(c) {
>                 pretemp1phi = phi(rare block pretemp1, preheader pretemp1).
>                 pretemp2phi = phi(rare block pretemp2, preheader pretemp2)
>
>  d += pretemp1phi + pretemp2phi
>  if (x == y) {
>  rare();
>                         pretemp1 = this->x;
>                         pretemp2 = this->y;
>
>  }
>  }
>  return d;
> }
>  I don't see why profile guided duplication is necessary here. This is a
> basic load PRE case.  It is handled by the first version of GVN-based Load
> PRE I wrote for GCC.  It is always a win.
>
> Well, to be pedantic, it is not always profitable.  If this loop runs
> exactly one iteration, this is both a code size pessimization and possibly
> a runtime execution pessimization.
>

Sure, but it is a lifetime optimal PRE case :)

> While in this particular case, you probably don't care, I'd worry that an
> algorithm which gets this might also have other pessimization cases which
> are more worrying.
>

Then you shouldn't use any basic PRE at all.
It will always make these decisions.  Profile guided PRE implementations
are more complex.

>
> However, I think I've actually come to the same underlying conclusion.
> This is a case that our PRE algorithm should handle without needing to
> resort to profiling data.
>
> I have to admit that I am unfamiliar with the GCC implementation.  Could
> you outline the algorithm and it's important concepts?  (links are fine too)
>

It uses an SCC based value numbering implementation(
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.1723), builds a
value expressions, then performs GVNPRE on the value expressions (
https://www.cs.purdue.edu/homes/hosking/papers/cc04.pdf)

>
>
>
>  Looking at what LLVM does, the failing on the PRE side is that our
> PRE/GVN models are not strong enough to handle this. I'm not sure at all
> why we think anything else is necessary.  It's certainly not requiring
> special code duplication heuristics, etc.
>
>  So either you are thinking of a case that is different from the above,
> or I am seriously confused :)
>
> Well, I think we've gotten to the same point here.  An improved PRE
> implementation should handle most of the cases I've actually encountered.
> Having said that, I'm not yet willing to say that the profile guided loop
> splitting isn't *also* worth exploring.  From a practical perspective, a
> lot of our existing optimizations are structured on loops.  Letting those
> kick in without falling back to (expensive) GVN might be worthwhile from a
> practical perspective.  This is as much a pass ordering problem as anything
> else.
>
> p.s. I've started to post patches which improve the results given by our
> current PRE to the llvm-commits list.  So far, it's just simple stuff, but
> that's the direction I'm current working in.  I'm going to defer looking
> into the loop nest splitting until I've pushed PRE as far as I easily can.
>

I truly think you would be better off coming up with a real PRE
implementation than trying to push the current one, but it's your time :)
The current PRE is an iterative PRE with a lot of limitations.

>
> Philip
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/4292b471/attachment.html>