[LLVMdev] Separating loop nests based on profile information?

Philip Reames listmail at philipreames.com
Mon Jan 12 09:48:30 PST 2015


On 01/11/2015 10:17 PM, Daniel Berlin wrote:
>
>
> On Thu, Jan 8, 2015 at 3:33 PM, Philip Reames 
> <listmail at philipreames.com <mailto:listmail at philipreames.com>> wrote:
>
>
>     On 01/07/2015 05:33 PM, Chandler Carruth wrote:
>
>         How does this compare with classical approaches of loop
>         peeling, partitioning, fission, or whatever you might call it?
>
>     I'm still developing a good sense for this, but let me lay out
>     some observations.  Fair warning, this is going to be long and
>     rambling.
>
>     Let's start with a toy example:
>     while(c) {
>       x = this->x;
>       y = this->y;
>       if (x == y) {
>         rare();
>       }
>     }
>
>
>     If we could tell x and y were loop invariant, we could unswitch
>     this loop.  However, the rare call clobbers our view of memory, so
>     LICM fails, and thus unswitch fails.
>
>
>     We'd like to apply PRE here and push the reload into the loop
>     preheader and the rare case.  This currently fails for two
>     reasons: 1) We don't allow code duplication during PRE, 
>
>
> ?????
> If we don't, we aren't doing real PRE. So i'm not entirely sure what 
> you mean here.
As I think you comment elsewhere in your response, we currently do not 
duplicate loads; we only move them.  The current structure is based 
around finding a legal point which doesn't introduce a load along any 
path where one didn't exist previously.  If we have a path which has two 
copies of the load, we try to find a place to move one of them such that 
all paths still have the available load and we've removed the extra load 
along the one path.

(Note: The above explanation is rough and does not parallel how the code 
is organized.)


>
>     and 2) the canonical loop-simplify form of having a single latch
>     means that PRE doesn't actually see the rare block at all, it sees
>     the preheader and the join point after the if block.
>
>
>
>
>     I think both of the previous are fixable:
>
>
>
> GCC's PRE already does the above.
> It doesn't do profile guided duplication.
> We aren't doing anything special with these blocks.
>
> here is the code I used to test:
>
> struct a{
> int x;
> int y;
> };
> extern void rare();
> int mai(int c, struct a *this)
> {
> int d = 0;
> while(c) {
> int x = this->x;
> int y = this->y;
> d += x + y;
> if (x == y) {
> rare();
> }
> }
> return d;
> }
>
>
> It will do exactly what you expect, it is transformed into:
>
> struct a{
> int x;
> int y;
> };
> extern void rare();
> int mai(int c, struct a *this)
> {
> int d = 0;
>         int pretemp1 = this->x
>         int pretemp2 = this->y
> while(c) {
>                 pretemp1phi = phi(rare block pretemp1, preheader 
> pretemp1).
>                 pretemp2phi = phi(rare block pretemp2, preheader pretemp2)
>
> d += pretemp1phi + pretemp2phi
> if (x == y) {
> rare();
>                         pretemp1 = this->x;
>                         pretemp2 = this->y;
>
> }
> }
> return d;
> }
> I don't see why profile guided duplication is necessary here. This is 
> a basic load PRE case.  It is handled by the first version of 
> GVN-based Load PRE I wrote for GCC.  It is always a win.
Well, to be pedantic, it is not always profitable.  If this loop runs 
exactly one iteration, this is both a code size pessimization and 
possibly a runtime execution pessimization.  While in this particular 
case, you probably don't care, I'd worry that an algorithm which gets 
this might also have other pessimization cases which are more worrying.

However, I think I've actually come to the same underlying conclusion.  
This is a case that our PRE algorithm should handle without needing to 
resort to profiling data.

I have to admit that I am unfamiliar with the GCC implementation. Could 
you outline the algorithm and it's important concepts?  (links are fine too)

>
> Looking at what LLVM does, the failing on the PRE side is that our 
> PRE/GVN models are not strong enough to handle this. I'm not sure at 
> all why we think anything else is necessary.  It's certainly not 
> requiring special code duplication heuristics, etc.
>
> So either you are thinking of a case that is different from the above, 
> or I am seriously confused :)
Well, I think we've gotten to the same point here.  An improved PRE 
implementation should handle most of the cases I've actually 
encountered.  Having said that, I'm not yet willing to say that the 
profile guided loop splitting isn't *also* worth exploring.  From a 
practical perspective, a lot of our existing optimizations are 
structured on loops.  Letting those kick in without falling back to 
(expensive) GVN might be worthwhile from a practical perspective. This 
is as much a pass ordering problem as anything else.

p.s. I've started to post patches which improve the results given by our 
current PRE to the llvm-commits list.  So far, it's just simple stuff, 
but that's the direction I'm current working in.  I'm going to defer 
looking into the loop nest splitting until I've pushed PRE as far as I 
easily can.

Philip

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/32ba8923/attachment.html>


More information about the llvm-dev mailing list