[PATCH/RFC] Pre-increment preparation pass

Tue Feb 5 17:14:40 PST 2013

On Feb 5, 2013, at 4:11 PM, Hal Finkel <hfinkel at anl.gov> wrote:

> 
> 
> ----- Original Message -----
>> From: "Andrew Trick" <atrick at apple.com>
>> To: "Hal Finkel" <hfinkel at anl.gov>
>> Cc: "l >> Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>, "Jakob Stoklund Olesen" <stoklund at 2pi.dk>
>> Sent: Tuesday, February 5, 2013 12:33:29 PM
>> Subject: Re: [PATCH/RFC] Pre-increment preparation pass
>> 
>> 
>> On Feb 4, 2013, at 11:06 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>> 
>>> ----- Original Message -----
>>>> From: "Andrew Trick" <atrick at apple.com>
>>>> To: "Hal Finkel" <hfinkel at anl.gov>
>>>> Cc: "l >> Commit Messages and Patches for LLVM"
>>>> <llvm-commits at cs.uiuc.edu>, "Jakob Stoklund Olesen"
>>>> <stoklund at 2pi.dk>
>>>> Sent: Monday, February 4, 2013 10:52:30 PM
>>>> Subject: Re: [PATCH/RFC] Pre-increment preparation pass
>>>> 
>>>> 
>>>> On Feb 4, 2013, at 7:33 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>>>> 
>>>>> If you don't mind, I'd appreciate some more specific advice.
>>>>> First,
>>>>> is the current implementation of LSR capable of performing this
>>>>> transformation:
>>>>>>> for (int i = 0; i < N; ++i) {
>>>>>>> x[i] = y[i]
>>>>>>> }
>>>>>>> needs to be transformed to look more like this:
>>>>>>> T *a = x[-1], *b = y[-1];
>>>>>>> for (int i = 0; i < N; ++i) {
>>>>>>> *++a = *++b;
>>>>>>> }
>>>>> or is this what the "straight-line address-chain formation pass"
>>>>> you imagine would do? If LSR can do this, what contributes to the
>>>>> decision of whether or not it should be done? In some sense, this
>>>>> is the most important part because this is what enables using the
>>>>> pre-increment forms in the first place. Convincing LSR to
>>>>> otherwise form the chains in unrolled loops seems to be a
>>>>> function
>>>>> of the chain cost functions. Where should I start looking to see
>>>>> how to modify those?
>>>> 
>>>> Now, regarding modifying the LSR pass itself. The chain heuristics
>>>> are in isProfitableIncrement and isProfitableChain.
>>>> 
>>>> It should be able to handle the above loop. I might be more likely
>>>> to
>>>> handle an unrolled version, but that's a matter of heuristics.
>>> 
>>> A few questions:
>>> 
>>> // Limit the number of chains to avoid quadratic behavior. We
>>> don't expect to
>>> // have more than a few IV increment chains in a loop. Missing a
>>> Chain falls
>>> // back to normal LSR behavior for those uses.
>>> static const unsigned MaxChains = 8;
>>> 
>>> So in my simple loop, are both x and y a chain (so those are two
>>> chains)? x[i+1] and y[i+1] should be additional members in the
>>> same two chains, right?
>> 
>> Yes.
>> 
>>> Does this MaxChains limit of 8 chains prevent the formation of more
>>> than 8 pointer-types phi variables? PPC has 32 GPRs, so maybe 16
>>> is a more reasonable limit?
>> 
>> The current limit is conservative. 16 sounds reasonable.
>> 
>>> /// operator< - Choose the lower cost.
>>> bool Cost::operator<(const Cost &Other) const {
>>> if (NumRegs != Other.NumRegs)
>>>   return NumRegs < Other.NumRegs;
>>> if (AddRecCost != Other.AddRecCost)
>>>   return AddRecCost < Other.AddRecCost;
>>> if (NumIVMuls != Other.NumIVMuls)
>>>   return NumIVMuls < Other.NumIVMuls;
>>> if (NumBaseAdds != Other.NumBaseAdds)
>>>   return NumBaseAdds < Other.NumBaseAdds;
>>> if (ImmCost != Other.ImmCost)
>>>   return ImmCost < Other.ImmCost;
>>> if (SetupCost != Other.SetupCost)
>>>   return SetupCost < Other.SetupCost;
>>> return false;
>>> }
>>> 
>>> Is this the thing that I need to look at changing? This seems to be
>>> prioritizing for register pressure. I can imagine this was most
>>> important on x86, but maybe not exactly what I want. Maybe I need
>>> to prioritize NumBaseAdds, etc. over NumRegs if NumRegs is
>>> sufficiently small (say, < 16)?
>> 
>> No, this is the cost function for the formulae built by LSR. You
>> should look at the isProfitableChain… functions.
>> 
>> IVChains are collected before the LSR solver. They prune the LSR
>> search space so that it doesn't blow up on unrolled loops.
>> 
>> After LSR finds, or fails to find a solution it rewrites the IVs at
>> the head of the chain.
>> 
>> Then IVChain formation runs to update the increments within the
>> chains.
> 
> Okay. So we have two issues...
> 
> 1. First, we need to make sure appropriate chain candidates are generated, it is not clear to me that this is currently happening. As currently implemented, the pre-increment transformation adds pointer-typed phis. Does LSR ever currently do this?

That's how IV chains are supposed to work. See the test cases ivchain.ll, ivchain-X86, ivchain-ARM and other tests with the string [Ch]ain.

But the expression for the head of the chain is determined by LSR. It often chooses to reuse an integer index across multiple array accesses to save registers.

The problem you're having is that LSR will not form the pointer-type phis if they don't already exist.  The reason I didn't do that, is that if the loop has other uses of the chain's induction variable, it can find a better solution that reuses the register.

But it would be great if you add that feature. It can be done in CollectChains under a target hook. For example, when we do this:

  // Visit phi backedges to determine if the chain can generate the IV postinc.

We could instead check if we've found any profitable chains that are worth completing (maybe ones that have no other users of the IV?). Then we can check if an existing phi is present. If so, complete the chain by adding the phi (no change to current behavior). If not, you can create a phi based on the expression at the head and tail of the chain after checking that the tail dominates the loop header, and add the phi to the chain's tail (making a "complete" chain). The subsequent LSR solver will have no knowledge of this phi, which is not ideal, but at least it won't undo the chain.

High level question: does LSR actually benefit your target? One option is to have a target hooks that bypasses the LSR solver but allows IV chains to be formed.

> 2. Second, we need to reward those candidates that have foldable increments. Looking at isProfitableChain, can we detect pre-increment chains by looking for chains where the final element has I->IncExpr->isZero()? Maybe this should decrease the cost by 1 (to account for the now-missing post-increment add).

The decision between pre/post increment addressing is done in DAGCombine. IVChains may currently tend to expose post-increment addressing at the head and tail of the chain. You should adjust the chain to compensate if your target prefers pre-increment for some reason.

Incidentally, why are pre-increment loads superior? That's not normally the case. Does it simply reduce critical path for indirect loads? I would expect postinc to be better in case it guides the hardware prefetcher.

> 
>> 
>>> // A complete chain likely eliminates the need for keeping the
>>> original IV in
>>> // a register. LSR does not currently know how to form a complete
>>> chain unless
>>> // the header phi already exists.
>>> if (isa<PHINode>(Chain.tailUserInst())
>>>     && SE.getSCEV(Chain.tailUserInst()) == Chain.Incs[0].IncExpr)
>>>     {
>>>   --cost;
>>> }
>>> 
>>> Maybe I want some extra logic here as well? PPC (like Hexagon) has
>>> a hardware loops that can hide inner-loop induction variables in
>>> special registers in simple cases.
>> 
>> Hopefully any hardware loop representation doesn't change the IR. I
>> think it should only affect machine code. Right?
> 
> Correct. The IR is not affected, but nevertheless, it essentially 'eliminates' the original IV after-the-fact. Maybe that should be accounted for in the cost function?

That's why isProfitableChain has this check:

  // A complete chain likely eliminates the need for keeping the original IV in
  // a register. LSR does not currently know how to form a complete chain unless
  // the header phi already exists.
  if (isa<PHINode>(Chain.tailUserInst())
      && SE.getSCEV(Chain.tailUserInst()) == Chain.Incs[0].IncExpr) {
    --cost;
  }

>>> The TTI header has:
>>> /// isLegalAddressingMode - Return true if the addressing mode
>>> represented by
>>> /// AM is legal for this target, for a load/store of the specified
>>> type.
>>> /// The type may be VoidTy, in which case only return true if the
>>> addressing
>>> /// mode is legal for a load/store of any legal type.
>>> /// TODO: Handle pre/postinc as well.
>>> virtual bool isLegalAddressingMode(Type *Ty, GlobalValue *BaseGV,
>>>                                    int64_t BaseOffset, bool
>>>                                    HasBaseReg,
>>>                                    int64_t Scale) const;
>>> would I need to extend this to remove the TODO, and then tie that
>>> into the costing function somehow? Perhaps the problem in general
>>> is that LSR has no special handling of pre/postinc modes.
>> 
>> Yes. This is where it gets tricky. postinc formation depends on
>> multiple address users, so I'm not sure this API makes sense.
>> 
>> See DAGCombiner::CombineToPreIndexedLoadStore. It already uses a
>> separate target hook to do something similar.
> 
> Fair enough. Fundamentally, there needs to be an extension of the algorithm in order to deal with increment folding, and that does not currently exist.

The IVChain algorithm was meant to expose pre/post increment addressing, but not to depend on it (because it's actually irrelevant from the register pressure point of view, and we didn't really have another way to measure "goodness"). Adding more target information to guide the algorithm is the right thing to do.

>>> Anything else that you can think of?
>> 
>> I honestly don't know what the IVChain heuristics should look like.
>> Start with making sure the framework can handle your cases. Then if
>> you decide to modify the pass we have to figure out a way to keep
>> all the other targets happy.
> 
> If I modify LSR to do this, then when pre/post-increments are not available, I think that the algorithm must reduce to the current one.

Realistically, yes. I can see other targets taking some advantage of better pre/postinc addressing eventually though, which is why it's good to continue evolving the current pass.

-Andy

>> 
>>>> 
>>>> I haven't looked at how LSR handles vectorized loops yet… you may
>>>> find some interesting behavior.
>>> 
>>> Why would this be different from any other loops?
>> 
>> I don't know. If the induction variables are never type casted, then
>> I don't expect any fundamental problems. I'm always wary when IR
>> input to a pass changes significantly.
> 
> :)
> 
> Thanks again,
> Hal
> 
>> 
>> -Andy
>> 
>>