[LLVMdev] Separating loop nests based on profile information?
Daniel Berlin
dberlin at dberlin.org
Mon Jan 12 12:38:31 PST 2015
On Mon, Jan 12, 2015 at 12:48 PM, Philip Reames <listmail at philipreames.com>
wrote:
>
> On 01/11/2015 10:17 PM, Daniel Berlin wrote:
>
>
>
> On Thu, Jan 8, 2015 at 3:33 PM, Philip Reames <listmail at philipreames.com>
> wrote:
>
>>
>> On 01/07/2015 05:33 PM, Chandler Carruth wrote:
>>
>>> How does this compare with classical approaches of loop peeling,
>>> partitioning, fission, or whatever you might call it?
>>>
>> I'm still developing a good sense for this, but let me lay out some
>> observations. Fair warning, this is going to be long and rambling.
>>
>> Let's start with a toy example:
>> while(c) {
>> x = this->x;
>> y = this->y;
>> if (x == y) {
>> rare();
>> }
>> }
>>
>>
>
>
>> If we could tell x and y were loop invariant, we could unswitch this
>> loop. However, the rare call clobbers our view of memory, so LICM fails,
>> and thus unswitch fails.
>>
>>
>
>> We'd like to apply PRE here and push the reload into the loop preheader
>> and the rare case. This currently fails for two reasons: 1) We don't allow
>> code duplication during PRE,
>
>
> ?????
> If we don't, we aren't doing real PRE. So i'm not entirely sure what you
> mean here.
>
> As I think you comment elsewhere in your response, we currently do not
> duplicate loads; we only move them. The current structure is based around
> finding a legal point which doesn't introduce a load along any path where
> one didn't exist previously. If we have a path which has two copies of the
> load, we try to find a place to move one of them such that all paths still
> have the available load and we've removed the extra load along the one
> path.
>
> (Note: The above explanation is rough and does not parallel how the code
> is organized.)
>
>
>
>
>
>
>> and 2) the canonical loop-simplify form of having a single latch means
>> that PRE doesn't actually see the rare block at all, it sees the preheader
>> and the join point after the if block.
>
>
>
>
>>
>> I think both of the previous are fixable:
>>
>
>
> GCC's PRE already does the above.
> It doesn't do profile guided duplication.
> We aren't doing anything special with these blocks.
>
> here is the code I used to test:
>
> struct a{
> int x;
> int y;
> };
> extern void rare();
> int mai(int c, struct a *this)
> {
> int d = 0;
> while(c) {
> int x = this->x;
> int y = this->y;
> d += x + y;
> if (x == y) {
> rare();
> }
> }
> return d;
> }
>
>
> It will do exactly what you expect, it is transformed into:
>
> struct a{
> int x;
> int y;
> };
> extern void rare();
> int mai(int c, struct a *this)
> {
> int d = 0;
> int pretemp1 = this->x
> int pretemp2 = this->y
>
> while(c) {
> pretemp1phi = phi(rare block pretemp1, preheader pretemp1).
> pretemp2phi = phi(rare block pretemp2, preheader pretemp2)
>
> d += pretemp1phi + pretemp2phi
> if (x == y) {
> rare();
> pretemp1 = this->x;
> pretemp2 = this->y;
>
> }
> }
> return d;
> }
> I don't see why profile guided duplication is necessary here. This is a
> basic load PRE case. It is handled by the first version of GVN-based Load
> PRE I wrote for GCC. It is always a win.
>
> Well, to be pedantic, it is not always profitable. If this loop runs
> exactly one iteration, this is both a code size pessimization and possibly
> a runtime execution pessimization.
>
Sure, but it is a lifetime optimal PRE case :)
> While in this particular case, you probably don't care, I'd worry that an
> algorithm which gets this might also have other pessimization cases which
> are more worrying.
>
Then you shouldn't use any basic PRE at all.
It will always make these decisions. Profile guided PRE implementations
are more complex.
>
> However, I think I've actually come to the same underlying conclusion.
> This is a case that our PRE algorithm should handle without needing to
> resort to profiling data.
>
> I have to admit that I am unfamiliar with the GCC implementation. Could
> you outline the algorithm and it's important concepts? (links are fine too)
>
It uses an SCC based value numbering implementation(
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.1723), builds a
value expressions, then performs GVNPRE on the value expressions (
https://www.cs.purdue.edu/homes/hosking/papers/cc04.pdf)
>
>
>
> Looking at what LLVM does, the failing on the PRE side is that our
> PRE/GVN models are not strong enough to handle this. I'm not sure at all
> why we think anything else is necessary. It's certainly not requiring
> special code duplication heuristics, etc.
>
> So either you are thinking of a case that is different from the above,
> or I am seriously confused :)
>
> Well, I think we've gotten to the same point here. An improved PRE
> implementation should handle most of the cases I've actually encountered.
> Having said that, I'm not yet willing to say that the profile guided loop
> splitting isn't *also* worth exploring. From a practical perspective, a
> lot of our existing optimizations are structured on loops. Letting those
> kick in without falling back to (expensive) GVN might be worthwhile from a
> practical perspective. This is as much a pass ordering problem as anything
> else.
>
> p.s. I've started to post patches which improve the results given by our
> current PRE to the llvm-commits list. So far, it's just simple stuff, but
> that's the direction I'm current working in. I'm going to defer looking
> into the loop nest splitting until I've pushed PRE as far as I easily can.
>
I truly think you would be better off coming up with a real PRE
implementation than trying to push the current one, but it's your time :)
The current PRE is an iterative PRE with a lot of limitations.
>
> Philip
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/4292b471/attachment.html>
More information about the llvm-dev
mailing list