[PATCH/RFC] Pre-increment preparation pass

Tue Feb 5 20:21:53 PST 2013

On Feb 5, 2013, at 6:29 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
>> The reason I didn't do that, is
>> that if the loop has other uses of the chain's induction variable,
>> it can find a better solution that reuses the register.
> 
> This is certainly reasonable. The specific issue on PPC is that there is a post-isel/pre-ra pass that transfers simple inner-loop IVs into special registers for use with special decrement-compare-and-branch instructions. As a result, an IV which can be made otherwise unused except for iteration counting becomes essentially free. Hexagon should have a similar issue (with the small difference that Hexagon has post-inc instead of pre-inc instructions). Once this is accounted for, the cost-benefit analysis changes.
> 
> Alternatively, as you mention, we can often eliminate the IV entirely, transferring the comparison from using the iteration count to using some final pointer value (for example).

Ok. I agree that should be fed in to LSR's cost model somehow.

>> But it would be great if you add that feature. It can be done in
>> CollectChains under a target hook. For example, when we do this:
>> 
>>  // Visit phi backedges to determine if the chain can generate the
>>  IV postinc.
>> 
>> We could instead check if we've found any profitable chains that are
>> worth completing (maybe ones that have no other users of the IV?).
>> Then we can check if an existing phi is present. If so, complete the
>> chain by adding the phi (no change to current behavior). If not, you
>> can create a phi based on the expression at the head and tail of the
>> chain after checking that the tail dominates the loop header, and
>> add the phi to the chain's tail (making a "complete" chain). The
>> subsequent LSR solver will have no knowledge of this phi, which is
>> not ideal, but at least it won't undo the chain.
> 
> Would this amount to speculatively adding the PHIs? I'll look at this.

Yes, it might make more sense to defer actually creating the phi until GenerateIVChain, after LSR does it's IV rewriting.

Currently, the head of the chain is still considered an IVUser by LSR's solver and will be rewritten. The easy thing might be to add it to IVIncSet so LSR totally ignores it, then directly generate a phi later that would only be used by that chain. The more sophisticated approach would be to force LSR's formulae to use a register holding the expression for the phi that we intend to create. If you do that, LSR's rewriter might then generate the phi. GenerateIVChain could to recognize the phi that has a matching expression, but rewrite the back edge to use the result of the last instruction in the chain.

There is a congruent phi cleanup that runs at the end of LSR which could help… or might hurt if it chooses the non-chained form of the phi instead. It can be controlled with heuristics to choose the "better" phi.

>> High level question: does LSR actually benefit your target? One
>> option is to have a target hooks that bypasses the LSR solver but
>> allows IV chains to be formed.
> 
> I've not experimented with turning it off. PPC does not have scaled addressing modes, so it does not benefit from that part of it, but it may at least benefit from the simplifications of the exit condition (if not other things).

It's partly a question of whether you tend to run out of registers on unrolled loops.

Simplifying the exit condition optimizes for compare against zero. That tends to free a register, but it might also help setup your hardware loops.

>>> 2. Second, we need to reward those candidates that have foldable
>>> increments. Looking at isProfitableChain, can we detect
>>> pre-increment chains by looking for chains where the final element
>>> has I->IncExpr->isZero()? Maybe this should decrease the cost by 1
>>> (to account for the now-missing post-increment add).
>> 
>> The decision between pre/post increment addressing is done in
>> DAGCombine. IVChains may currently tend to expose post-increment
>> addressing at the head and tail of the chain. You should adjust the
>> chain to compensate if your target prefers pre-increment for some
>> reason.
>> 
>> Incidentally, why are pre-increment loads superior? That's not
>> normally the case. Does it simply reduce critical path for indirect
>> loads? I would expect postinc to be better in case it guides the
>> hardware prefetcher.
> 
> Pre-increment loads are superior to post-increment loads on PPC because PPC lacks post-increment loads -- it is not a design decision that I would have made. ;)

Ok. That doesn't really explain why it's faster than an add->load. I'll guess it removes a cycle from the critical path due to a special address generation unit, as you hinted at in your response (fewer instructions).

-Andy

>> 
>>> 
>>>> 
>>>>> // A complete chain likely eliminates the need for keeping the
>>>>> original IV in
>>>>> // a register. LSR does not currently know how to form a complete
>>>>> chain unless
>>>>> // the header phi already exists.
>>>>> if (isa<PHINode>(Chain.tailUserInst())
>>>>>    && SE.getSCEV(Chain.tailUserInst()) == Chain.Incs[0].IncExpr)
>>>>>    {
>>>>>  --cost;
>>>>> }
>>>>> 
>>>>> Maybe I want some extra logic here as well? PPC (like Hexagon)
>>>>> has
>>>>> a hardware loops that can hide inner-loop induction variables in
>>>>> special registers in simple cases.
>>>> 
>>>> Hopefully any hardware loop representation doesn't change the IR.
>>>> I
>>>> think it should only affect machine code. Right?
>>> 
>>> Correct. The IR is not affected, but nevertheless, it essentially
>>> 'eliminates' the original IV after-the-fact. Maybe that should be
>>> accounted for in the cost function?
>> 
>> That's why isProfitableChain has this check:
>> 
>>  // A complete chain likely eliminates the need for keeping the
>>  original IV in
>>  // a register. LSR does not currently know how to form a complete
>>  chain unless
>>  // the header phi already exists.
>>  if (isa<PHINode>(Chain.tailUserInst())
>>      && SE.getSCEV(Chain.tailUserInst()) == Chain.Incs[0].IncExpr) {
>>    --cost;
>>  }
> 
> Okay, sounds good.
> 
>> 
>>>>> The TTI header has:
>>>>> /// isLegalAddressingMode - Return true if the addressing mode
>>>>> represented by
>>>>> /// AM is legal for this target, for a load/store of the
>>>>> specified
>>>>> type.
>>>>> /// The type may be VoidTy, in which case only return true if the
>>>>> addressing
>>>>> /// mode is legal for a load/store of any legal type.
>>>>> /// TODO: Handle pre/postinc as well.
>>>>> virtual bool isLegalAddressingMode(Type *Ty, GlobalValue *BaseGV,
>>>>>                                   int64_t BaseOffset, bool
>>>>>                                   HasBaseReg,
>>>>>                                   int64_t Scale) const;
>>>>> would I need to extend this to remove the TODO, and then tie that
>>>>> into the costing function somehow? Perhaps the problem in general
>>>>> is that LSR has no special handling of pre/postinc modes.
>>>> 
>>>> Yes. This is where it gets tricky. postinc formation depends on
>>>> multiple address users, so I'm not sure this API makes sense.
>>>> 
>>>> See DAGCombiner::CombineToPreIndexedLoadStore. It already uses a
>>>> separate target hook to do something similar.
>>> 
>>> Fair enough. Fundamentally, there needs to be an extension of the
>>> algorithm in order to deal with increment folding, and that does
>>> not currently exist.
>> 
>> The IVChain algorithm was meant to expose pre/post increment
>> addressing, but not to depend on it (because it's actually
>> irrelevant from the register pressure point of view, and we didn't
>> really have another way to measure "goodness"). Adding more target
>> information to guide the algorithm is the right thing to do.
> 
> Okay, that makes sense. In my case, my measure of goodness is a decrease in the total number of instructions because that equals fewer cycles.
> 
>> 
>>>>> Anything else that you can think of?
>>>> 
>>>> I honestly don't know what the IVChain heuristics should look
>>>> like.
>>>> Start with making sure the framework can handle your cases. Then
>>>> if
>>>> you decide to modify the pass we have to figure out a way to keep
>>>> all the other targets happy.
>>> 
>>> If I modify LSR to do this, then when pre/post-increments are not
>>> available, I think that the algorithm must reduce to the current
>>> one.
>> 
>> Realistically, yes. I can see other targets taking some advantage of
>> better pre/postinc addressing eventually though, which is why it's
>> good to continue evolving the current pass.
> 
> Sounds good. Thanks again,
> Hal
> 
>> 
>> -Andy
>> 
>>>> 
>>>>>> 
>>>>>> I haven't looked at how LSR handles vectorized loops yet… you
>>>>>> may
>>>>>> find some interesting behavior.
>>>>> 
>>>>> Why would this be different from any other loops?
>>>> 
>>>> I don't know. If the induction variables are never type casted,
>>>> then
>>>> I don't expect any fundamental problems. I'm always wary when IR
>>>> input to a pass changes significantly.
>>> 
>>> :)
>>> 
>>> Thanks again,
>>> Hal
>>> 
>>>> 
>>>> -Andy
>>>> 
>>>> 
>> 
>>