[PATCH/RFC] Pre-increment preparation pass

Tue Feb 5 10:33:29 PST 2013

On Feb 4, 2013, at 11:06 PM, Hal Finkel <hfinkel at anl.gov> wrote:

> ----- Original Message -----
>> From: "Andrew Trick" <atrick at apple.com>
>> To: "Hal Finkel" <hfinkel at anl.gov>
>> Cc: "l >> Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>, "Jakob Stoklund Olesen" <stoklund at 2pi.dk>
>> Sent: Monday, February 4, 2013 10:52:30 PM
>> Subject: Re: [PATCH/RFC] Pre-increment preparation pass
>> 
>> 
>> On Feb 4, 2013, at 7:33 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>> 
>>> If you don't mind, I'd appreciate some more specific advice. First,
>>> is the current implementation of LSR capable of performing this
>>> transformation:
>>>>> for (int i = 0; i < N; ++i) {
>>>>> x[i] = y[i]
>>>>> }
>>>>> needs to be transformed to look more like this:
>>>>> T *a = x[-1], *b = y[-1];
>>>>> for (int i = 0; i < N; ++i) {
>>>>> *++a = *++b;
>>>>> }
>>> or is this what the "straight-line address-chain formation pass"
>>> you imagine would do? If LSR can do this, what contributes to the
>>> decision of whether or not it should be done? In some sense, this
>>> is the most important part because this is what enables using the
>>> pre-increment forms in the first place. Convincing LSR to
>>> otherwise form the chains in unrolled loops seems to be a function
>>> of the chain cost functions. Where should I start looking to see
>>> how to modify those?
>> 
>> Now, regarding modifying the LSR pass itself. The chain heuristics
>> are in isProfitableIncrement and isProfitableChain.
>> 
>> It should be able to handle the above loop. I might be more likely to
>> handle an unrolled version, but that's a matter of heuristics.
> 
> A few questions:
> 
>  // Limit the number of chains to avoid quadratic behavior. We don't expect to
>  // have more than a few IV increment chains in a loop. Missing a Chain falls
>  // back to normal LSR behavior for those uses.
>  static const unsigned MaxChains = 8;
> 
> So in my simple loop, are both x and y a chain (so those are two chains)? x[i+1] and y[i+1] should be additional members in the same two chains, right?

Yes.

> Does this MaxChains limit of 8 chains prevent the formation of more than 8 pointer-types phi variables? PPC has 32 GPRs, so maybe 16 is a more reasonable limit?

The current limit is conservative. 16 sounds reasonable.

> /// operator< - Choose the lower cost.
> bool Cost::operator<(const Cost &Other) const {
>  if (NumRegs != Other.NumRegs)
>    return NumRegs < Other.NumRegs;
>  if (AddRecCost != Other.AddRecCost)
>    return AddRecCost < Other.AddRecCost;
>  if (NumIVMuls != Other.NumIVMuls)
>    return NumIVMuls < Other.NumIVMuls;
>  if (NumBaseAdds != Other.NumBaseAdds)
>    return NumBaseAdds < Other.NumBaseAdds;
>  if (ImmCost != Other.ImmCost)
>    return ImmCost < Other.ImmCost;
>  if (SetupCost != Other.SetupCost)
>    return SetupCost < Other.SetupCost;
>  return false;
> }
> 
> Is this the thing that I need to look at changing? This seems to be prioritizing for register pressure. I can imagine this was most important on x86, but maybe not exactly what I want. Maybe I need to prioritize NumBaseAdds, etc. over NumRegs if NumRegs is sufficiently small (say, < 16)?

No, this is the cost function for the formulae built by LSR. You should look at the isProfitableChain… functions.

IVChains are collected before the LSR solver. They prune the LSR search space so that it doesn't blow up on unrolled loops.

After LSR finds, or fails to find a solution it rewrites the IVs at the head of the chain.

Then IVChain formation runs to update the increments within the chains.

>  // A complete chain likely eliminates the need for keeping the original IV in
>  // a register. LSR does not currently know how to form a complete chain unless
>  // the header phi already exists.
>  if (isa<PHINode>(Chain.tailUserInst())
>      && SE.getSCEV(Chain.tailUserInst()) == Chain.Incs[0].IncExpr) {
>    --cost;
>  }
> 
> Maybe I want some extra logic here as well? PPC (like Hexagon) has a hardware loops that can hide inner-loop induction variables in special registers in simple cases.

Hopefully any hardware loop representation doesn't change the IR. I think it should only affect machine code. Right?

> The TTI header has:
>  /// isLegalAddressingMode - Return true if the addressing mode represented by
>  /// AM is legal for this target, for a load/store of the specified type.
>  /// The type may be VoidTy, in which case only return true if the addressing
>  /// mode is legal for a load/store of any legal type.
>  /// TODO: Handle pre/postinc as well.
>  virtual bool isLegalAddressingMode(Type *Ty, GlobalValue *BaseGV,
>                                     int64_t BaseOffset, bool HasBaseReg,
>                                     int64_t Scale) const;
> would I need to extend this to remove the TODO, and then tie that into the costing function somehow? Perhaps the problem in general is that LSR has no special handling of pre/postinc modes.

Yes. This is where it gets tricky. postinc formation depends on multiple address users, so I'm not sure this API makes sense.

See DAGCombiner::CombineToPreIndexedLoadStore. It already uses a separate target hook to do something similar.

> Anything else that you can think of? 

I honestly don't know what the IVChain heuristics should look like. Start with making sure the framework can handle your cases. Then if you decide to modify the pass we have to figure out a way to keep all the other targets happy.

>> 
>> I haven't looked at how LSR handles vectorized loops yet… you may
>> find some interesting behavior.
> 
> Why would this be different from any other loops?

I don't know. If the induction variables are never type casted, then I don't expect any fundamental problems. I'm always wary when IR input to a pass changes significantly.

-Andy