[PATCH/RFC] Pre-increment preparation pass

Tue Feb 5 18:29:23 PST 2013

----- Original Message -----
> From: "Andrew Trick" <atrick at apple.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "l >> Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>, "Jakob Stoklund Olesen" <stoklund at 2pi.dk>
> Sent: Tuesday, February 5, 2013 7:14:40 PM
> Subject: Re: [PATCH/RFC] Pre-increment preparation pass
> 
> 
> On Feb 5, 2013, at 4:11 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> > 
> > 
> > ----- Original Message -----
> >> From: "Andrew Trick" <atrick at apple.com>
> >> To: "Hal Finkel" <hfinkel at anl.gov>
> >> Cc: "l >> Commit Messages and Patches for LLVM"
> >> <llvm-commits at cs.uiuc.edu>, "Jakob Stoklund Olesen"
> >> <stoklund at 2pi.dk>
> >> Sent: Tuesday, February 5, 2013 12:33:29 PM
> >> Subject: Re: [PATCH/RFC] Pre-increment preparation pass
> >> 
> >> 
> >> On Feb 4, 2013, at 11:06 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> >> 
> >>> ----- Original Message -----
> >>>> From: "Andrew Trick" <atrick at apple.com>
> >>>> To: "Hal Finkel" <hfinkel at anl.gov>
> >>>> Cc: "l >> Commit Messages and Patches for LLVM"
> >>>> <llvm-commits at cs.uiuc.edu>, "Jakob Stoklund Olesen"
> >>>> <stoklund at 2pi.dk>
> >>>> Sent: Monday, February 4, 2013 10:52:30 PM
> >>>> Subject: Re: [PATCH/RFC] Pre-increment preparation pass
> >>>> 
> >>>> 
> >>>> On Feb 4, 2013, at 7:33 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> >>>>> 
> >>>>> If you don't mind, I'd appreciate some more specific advice.
> >>>>> First,
> >>>>> is the current implementation of LSR capable of performing this
> >>>>> transformation:
> >>>>>>> for (int i = 0; i < N; ++i) {
> >>>>>>> x[i] = y[i]
> >>>>>>> }
> >>>>>>> needs to be transformed to look more like this:
> >>>>>>> T *a = x[-1], *b = y[-1];
> >>>>>>> for (int i = 0; i < N; ++i) {
> >>>>>>> *++a = *++b;
> >>>>>>> }
> >>>>> or is this what the "straight-line address-chain formation
> >>>>> pass"
> >>>>> you imagine would do? If LSR can do this, what contributes to
> >>>>> the
> >>>>> decision of whether or not it should be done? In some sense,
> >>>>> this
> >>>>> is the most important part because this is what enables using
> >>>>> the
> >>>>> pre-increment forms in the first place. Convincing LSR to
> >>>>> otherwise form the chains in unrolled loops seems to be a
> >>>>> function
> >>>>> of the chain cost functions. Where should I start looking to
> >>>>> see
> >>>>> how to modify those?
> >>>> 
> >>>> Now, regarding modifying the LSR pass itself. The chain
> >>>> heuristics
> >>>> are in isProfitableIncrement and isProfitableChain.
> >>>> 
> >>>> It should be able to handle the above loop. I might be more
> >>>> likely
> >>>> to
> >>>> handle an unrolled version, but that's a matter of heuristics.
> >>> 
> >>> A few questions:
> >>> 
> >>> // Limit the number of chains to avoid quadratic behavior. We
> >>> don't expect to
> >>> // have more than a few IV increment chains in a loop. Missing a
> >>> Chain falls
> >>> // back to normal LSR behavior for those uses.
> >>> static const unsigned MaxChains = 8;
> >>> 
> >>> So in my simple loop, are both x and y a chain (so those are two
> >>> chains)? x[i+1] and y[i+1] should be additional members in the
> >>> same two chains, right?
> >> 
> >> Yes.
> >> 
> >>> Does this MaxChains limit of 8 chains prevent the formation of
> >>> more
> >>> than 8 pointer-types phi variables? PPC has 32 GPRs, so maybe 16
> >>> is a more reasonable limit?
> >> 
> >> The current limit is conservative. 16 sounds reasonable.
> >> 
> >>> /// operator< - Choose the lower cost.
> >>> bool Cost::operator<(const Cost &Other) const {
> >>> if (NumRegs != Other.NumRegs)
> >>>   return NumRegs < Other.NumRegs;
> >>> if (AddRecCost != Other.AddRecCost)
> >>>   return AddRecCost < Other.AddRecCost;
> >>> if (NumIVMuls != Other.NumIVMuls)
> >>>   return NumIVMuls < Other.NumIVMuls;
> >>> if (NumBaseAdds != Other.NumBaseAdds)
> >>>   return NumBaseAdds < Other.NumBaseAdds;
> >>> if (ImmCost != Other.ImmCost)
> >>>   return ImmCost < Other.ImmCost;
> >>> if (SetupCost != Other.SetupCost)
> >>>   return SetupCost < Other.SetupCost;
> >>> return false;
> >>> }
> >>> 
> >>> Is this the thing that I need to look at changing? This seems to
> >>> be
> >>> prioritizing for register pressure. I can imagine this was most
> >>> important on x86, but maybe not exactly what I want. Maybe I need
> >>> to prioritize NumBaseAdds, etc. over NumRegs if NumRegs is
> >>> sufficiently small (say, < 16)?
> >> 
> >> No, this is the cost function for the formulae built by LSR. You
> >> should look at the isProfitableChain… functions.
> >> 
> >> IVChains are collected before the LSR solver. They prune the LSR
> >> search space so that it doesn't blow up on unrolled loops.
> >> 
> >> After LSR finds, or fails to find a solution it rewrites the IVs
> >> at
> >> the head of the chain.
> >> 
> >> Then IVChain formation runs to update the increments within the
> >> chains.
> > 
> > Okay. So we have two issues...
> > 
> > 1. First, we need to make sure appropriate chain candidates are
> > generated, it is not clear to me that this is currently happening.
> > As currently implemented, the pre-increment transformation adds
> > pointer-typed phis. Does LSR ever currently do this?
> 
> That's how IV chains are supposed to work. See the test cases
> ivchain.ll, ivchain-X86, ivchain-ARM and other tests with the string
> [Ch]ain.
> 
> But the expression for the head of the chain is determined by LSR. It
> often chooses to reuse an integer index across multiple array
> accesses to save registers.
> 
> The problem you're having is that LSR will not form the pointer-type
> phis if they don't already exist.

Exactly.

>  The reason I didn't do that, is
> that if the loop has other uses of the chain's induction variable,
> it can find a better solution that reuses the register.

This is certainly reasonable. The specific issue on PPC is that there is a post-isel/pre-ra pass that transfers simple inner-loop IVs into special registers for use with special decrement-compare-and-branch instructions. As a result, an IV which can be made otherwise unused except for iteration counting becomes essentially free. Hexagon should have a similar issue (with the small difference that Hexagon has post-inc instead of pre-inc instructions). Once this is accounted for, the cost-benefit analysis changes.

Alternatively, as you mention, we can often eliminate the IV entirely, transferring the comparison from using the iteration count to using some final pointer value (for example).

> 
> But it would be great if you add that feature. It can be done in
> CollectChains under a target hook. For example, when we do this:
> 
>   // Visit phi backedges to determine if the chain can generate the
>   IV postinc.
> 
> We could instead check if we've found any profitable chains that are
> worth completing (maybe ones that have no other users of the IV?).
> Then we can check if an existing phi is present. If so, complete the
> chain by adding the phi (no change to current behavior). If not, you
> can create a phi based on the expression at the head and tail of the
> chain after checking that the tail dominates the loop header, and
> add the phi to the chain's tail (making a "complete" chain). The
> subsequent LSR solver will have no knowledge of this phi, which is
> not ideal, but at least it won't undo the chain.

Would this amount to speculatively adding the PHIs? I'll look at this.

> 
> High level question: does LSR actually benefit your target? One
> option is to have a target hooks that bypasses the LSR solver but
> allows IV chains to be formed.

I've not experimented with turning it off. PPC does not have scaled addressing modes, so it does not benefit from that part of it, but it may at least benefit from the simplifications of the exit condition (if not other things).

> 
> > 2. Second, we need to reward those candidates that have foldable
> > increments. Looking at isProfitableChain, can we detect
> > pre-increment chains by looking for chains where the final element
> > has I->IncExpr->isZero()? Maybe this should decrease the cost by 1
> > (to account for the now-missing post-increment add).
> 
> The decision between pre/post increment addressing is done in
> DAGCombine. IVChains may currently tend to expose post-increment
> addressing at the head and tail of the chain. You should adjust the
> chain to compensate if your target prefers pre-increment for some
> reason.
> 
> Incidentally, why are pre-increment loads superior? That's not
> normally the case. Does it simply reduce critical path for indirect
> loads? I would expect postinc to be better in case it guides the
> hardware prefetcher.

Pre-increment loads are superior to post-increment loads on PPC because PPC lacks post-increment loads -- it is not a design decision that I would have made. ;)

> 
> > 
> >> 
> >>> // A complete chain likely eliminates the need for keeping the
> >>> original IV in
> >>> // a register. LSR does not currently know how to form a complete
> >>> chain unless
> >>> // the header phi already exists.
> >>> if (isa<PHINode>(Chain.tailUserInst())
> >>>     && SE.getSCEV(Chain.tailUserInst()) == Chain.Incs[0].IncExpr)
> >>>     {
> >>>   --cost;
> >>> }
> >>> 
> >>> Maybe I want some extra logic here as well? PPC (like Hexagon)
> >>> has
> >>> a hardware loops that can hide inner-loop induction variables in
> >>> special registers in simple cases.
> >> 
> >> Hopefully any hardware loop representation doesn't change the IR.
> >> I
> >> think it should only affect machine code. Right?
> > 
> > Correct. The IR is not affected, but nevertheless, it essentially
> > 'eliminates' the original IV after-the-fact. Maybe that should be
> > accounted for in the cost function?
> 
> That's why isProfitableChain has this check:
> 
>   // A complete chain likely eliminates the need for keeping the
>   original IV in
>   // a register. LSR does not currently know how to form a complete
>   chain unless
>   // the header phi already exists.
>   if (isa<PHINode>(Chain.tailUserInst())
>       && SE.getSCEV(Chain.tailUserInst()) == Chain.Incs[0].IncExpr) {
>     --cost;
>   }

Okay, sounds good.

> 
> >>> The TTI header has:
> >>> /// isLegalAddressingMode - Return true if the addressing mode
> >>> represented by
> >>> /// AM is legal for this target, for a load/store of the
> >>> specified
> >>> type.
> >>> /// The type may be VoidTy, in which case only return true if the
> >>> addressing
> >>> /// mode is legal for a load/store of any legal type.
> >>> /// TODO: Handle pre/postinc as well.
> >>> virtual bool isLegalAddressingMode(Type *Ty, GlobalValue *BaseGV,
> >>>                                    int64_t BaseOffset, bool
> >>>                                    HasBaseReg,
> >>>                                    int64_t Scale) const;
> >>> would I need to extend this to remove the TODO, and then tie that
> >>> into the costing function somehow? Perhaps the problem in general
> >>> is that LSR has no special handling of pre/postinc modes.
> >> 
> >> Yes. This is where it gets tricky. postinc formation depends on
> >> multiple address users, so I'm not sure this API makes sense.
> >> 
> >> See DAGCombiner::CombineToPreIndexedLoadStore. It already uses a
> >> separate target hook to do something similar.
> > 
> > Fair enough. Fundamentally, there needs to be an extension of the
> > algorithm in order to deal with increment folding, and that does
> > not currently exist.
> 
> The IVChain algorithm was meant to expose pre/post increment
> addressing, but not to depend on it (because it's actually
> irrelevant from the register pressure point of view, and we didn't
> really have another way to measure "goodness"). Adding more target
> information to guide the algorithm is the right thing to do.

Okay, that makes sense. In my case, my measure of goodness is a decrease in the total number of instructions because that equals fewer cycles.

> 
> >>> Anything else that you can think of?
> >> 
> >> I honestly don't know what the IVChain heuristics should look
> >> like.
> >> Start with making sure the framework can handle your cases. Then
> >> if
> >> you decide to modify the pass we have to figure out a way to keep
> >> all the other targets happy.
> > 
> > If I modify LSR to do this, then when pre/post-increments are not
> > available, I think that the algorithm must reduce to the current
> > one.
> 
> Realistically, yes. I can see other targets taking some advantage of
> better pre/postinc addressing eventually though, which is why it's
> good to continue evolving the current pass.

Sounds good. Thanks again,
Hal

> 
> -Andy
> 
> >> 
> >>>> 
> >>>> I haven't looked at how LSR handles vectorized loops yet… you
> >>>> may
> >>>> find some interesting behavior.
> >>> 
> >>> Why would this be different from any other loops?
> >> 
> >> I don't know. If the induction variables are never type casted,
> >> then
> >> I don't expect any fundamental problems. I'm always wary when IR
> >> input to a pass changes significantly.
> > 
> > :)
> > 
> > Thanks again,
> > Hal
> > 
> >> 
> >> -Andy
> >> 
> >> 
> 
>