[PATCH/RFC] Pre-increment preparation pass

Tue Feb 5 20:52:30 PST 2013

----- Original Message -----
> From: "Andrew Trick" <atrick at apple.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "l >> Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>, "Jakob Stoklund Olesen" <stoklund at 2pi.dk>
> Sent: Tuesday, February 5, 2013 10:21:53 PM
> Subject: Re: [PATCH/RFC] Pre-increment preparation pass
> 
> 
> On Feb 5, 2013, at 6:29 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> > 
> >> The reason I didn't do that, is
> >> that if the loop has other uses of the chain's induction variable,
> >> it can find a better solution that reuses the register.
> > 
> > This is certainly reasonable. The specific issue on PPC is that
> > there is a post-isel/pre-ra pass that transfers simple inner-loop
> > IVs into special registers for use with special
> > decrement-compare-and-branch instructions. As a result, an IV
> > which can be made otherwise unused except for iteration counting
> > becomes essentially free. Hexagon should have a similar issue
> > (with the small difference that Hexagon has post-inc instead of
> > pre-inc instructions). Once this is accounted for, the
> > cost-benefit analysis changes.
> > 
> > Alternatively, as you mention, we can often eliminate the IV
> > entirely, transferring the comparison from using the iteration
> > count to using some final pointer value (for example).
> 
> Ok. I agree that should be fed in to LSR's cost model somehow.
> 
> >> But it would be great if you add that feature. It can be done in
> >> CollectChains under a target hook. For example, when we do this:
> >> 
> >>  // Visit phi backedges to determine if the chain can generate the
> >>  IV postinc.
> >> 
> >> We could instead check if we've found any profitable chains that
> >> are
> >> worth completing (maybe ones that have no other users of the IV?).
> >> Then we can check if an existing phi is present. If so, complete
> >> the
> >> chain by adding the phi (no change to current behavior). If not,
> >> you
> >> can create a phi based on the expression at the head and tail of
> >> the
> >> chain after checking that the tail dominates the loop header, and
> >> add the phi to the chain's tail (making a "complete" chain). The
> >> subsequent LSR solver will have no knowledge of this phi, which is
> >> not ideal, but at least it won't undo the chain.
> > 
> > Would this amount to speculatively adding the PHIs? I'll look at
> > this.
> 
> Yes, it might make more sense to defer actually creating the phi
> until GenerateIVChain, after LSR does it's IV rewriting.
> 
> Currently, the head of the chain is still considered an IVUser by
> LSR's solver and will be rewritten. The easy thing might be to add
> it to IVIncSet so LSR totally ignores it, then directly generate a
> phi later that would only be used by that chain. The more
> sophisticated approach would be to force LSR's formulae to use a
> register holding the expression for the phi that we intend to
> create. If you do that, LSR's rewriter might then generate the phi.
> GenerateIVChain could to recognize the phi that has a matching
> expression, but rewrite the back edge to use the result of the last
> instruction in the chain.

It sounds like the "sophisticated approach" is the right one, do you agree?

> 
> There is a congruent phi cleanup that runs at the end of LSR which
> could help… or might hurt if it chooses the non-chained form of the
> phi instead. It can be controlled with heuristics to choose the
> "better" phi.

I'll look at this as well.

> 
> >> High level question: does LSR actually benefit your target? One
> >> option is to have a target hooks that bypasses the LSR solver but
> >> allows IV chains to be formed.
> > 
> > I've not experimented with turning it off. PPC does not have scaled
> > addressing modes, so it does not benefit from that part of it, but
> > it may at least benefit from the simplifications of the exit
> > condition (if not other things).
> 
> It's partly a question of whether you tend to run out of registers on
> unrolled loops.

No, spilling is very expensive on the BG/Q. That having been said, getting good performance often means covering a lot of latency, and that often requires running close to the register limit. As a result, we might go over sometimes. Also, scientific codes sometimes have manually-unrolled loops, and I'd like to compile those well.

> 
> Simplifying the exit condition optimizes for compare against zero.
> That tends to free a register, but it might also help setup your
> hardware loops.

The current hardware-loops implementation depends on this.

> 
> >>> 2. Second, we need to reward those candidates that have foldable
> >>> increments. Looking at isProfitableChain, can we detect
> >>> pre-increment chains by looking for chains where the final
> >>> element
> >>> has I->IncExpr->isZero()? Maybe this should decrease the cost by
> >>> 1
> >>> (to account for the now-missing post-increment add).
> >> 
> >> The decision between pre/post increment addressing is done in
> >> DAGCombine. IVChains may currently tend to expose post-increment
> >> addressing at the head and tail of the chain. You should adjust
> >> the
> >> chain to compensate if your target prefers pre-increment for some
> >> reason.
> >> 
> >> Incidentally, why are pre-increment loads superior? That's not
> >> normally the case. Does it simply reduce critical path for
> >> indirect
> >> loads? I would expect postinc to be better in case it guides the
> >> hardware prefetcher.
> > 
> > Pre-increment loads are superior to post-increment loads on PPC
> > because PPC lacks post-increment loads -- it is not a design
> > decision that I would have made. ;)
> 
> Ok. That doesn't really explain why it's faster than an add->load.
> I'll guess it removes a cycle from the critical path due to a
> special address generation unit, as you hinted at in your response
> (fewer instructions).

This is exactly right, it saves a cycle.

Thanks again,
Hal

> 
> -Andy
> 
> >> 
> >>> 
> >>>> 
> >>>>> // A complete chain likely eliminates the need for keeping the
> >>>>> original IV in
> >>>>> // a register. LSR does not currently know how to form a
> >>>>> complete
> >>>>> chain unless
> >>>>> // the header phi already exists.
> >>>>> if (isa<PHINode>(Chain.tailUserInst())
> >>>>>    && SE.getSCEV(Chain.tailUserInst()) ==
> >>>>>    Chain.Incs[0].IncExpr)
> >>>>>    {
> >>>>>  --cost;
> >>>>> }
> >>>>> 
> >>>>> Maybe I want some extra logic here as well? PPC (like Hexagon)
> >>>>> has
> >>>>> a hardware loops that can hide inner-loop induction variables
> >>>>> in
> >>>>> special registers in simple cases.
> >>>> 
> >>>> Hopefully any hardware loop representation doesn't change the
> >>>> IR.
> >>>> I
> >>>> think it should only affect machine code. Right?
> >>> 
> >>> Correct. The IR is not affected, but nevertheless, it essentially
> >>> 'eliminates' the original IV after-the-fact. Maybe that should be
> >>> accounted for in the cost function?
> >> 
> >> That's why isProfitableChain has this check:
> >> 
> >>  // A complete chain likely eliminates the need for keeping the
> >>  original IV in
> >>  // a register. LSR does not currently know how to form a complete
> >>  chain unless
> >>  // the header phi already exists.
> >>  if (isa<PHINode>(Chain.tailUserInst())
> >>      && SE.getSCEV(Chain.tailUserInst()) == Chain.Incs[0].IncExpr)
> >>      {
> >>    --cost;
> >>  }
> > 
> > Okay, sounds good.
> > 
> >> 
> >>>>> The TTI header has:
> >>>>> /// isLegalAddressingMode - Return true if the addressing mode
> >>>>> represented by
> >>>>> /// AM is legal for this target, for a load/store of the
> >>>>> specified
> >>>>> type.
> >>>>> /// The type may be VoidTy, in which case only return true if
> >>>>> the
> >>>>> addressing
> >>>>> /// mode is legal for a load/store of any legal type.
> >>>>> /// TODO: Handle pre/postinc as well.
> >>>>> virtual bool isLegalAddressingMode(Type *Ty, GlobalValue
> >>>>> *BaseGV,
> >>>>>                                   int64_t BaseOffset, bool
> >>>>>                                   HasBaseReg,
> >>>>>                                   int64_t Scale) const;
> >>>>> would I need to extend this to remove the TODO, and then tie
> >>>>> that
> >>>>> into the costing function somehow? Perhaps the problem in
> >>>>> general
> >>>>> is that LSR has no special handling of pre/postinc modes.
> >>>> 
> >>>> Yes. This is where it gets tricky. postinc formation depends on
> >>>> multiple address users, so I'm not sure this API makes sense.
> >>>> 
> >>>> See DAGCombiner::CombineToPreIndexedLoadStore. It already uses a
> >>>> separate target hook to do something similar.
> >>> 
> >>> Fair enough. Fundamentally, there needs to be an extension of the
> >>> algorithm in order to deal with increment folding, and that does
> >>> not currently exist.
> >> 
> >> The IVChain algorithm was meant to expose pre/post increment
> >> addressing, but not to depend on it (because it's actually
> >> irrelevant from the register pressure point of view, and we didn't
> >> really have another way to measure "goodness"). Adding more target
> >> information to guide the algorithm is the right thing to do.
> > 
> > Okay, that makes sense. In my case, my measure of goodness is a
> > decrease in the total number of instructions because that equals
> > fewer cycles.
> > 
> >> 
> >>>>> Anything else that you can think of?
> >>>> 
> >>>> I honestly don't know what the IVChain heuristics should look
> >>>> like.
> >>>> Start with making sure the framework can handle your cases. Then
> >>>> if
> >>>> you decide to modify the pass we have to figure out a way to
> >>>> keep
> >>>> all the other targets happy.
> >>> 
> >>> If I modify LSR to do this, then when pre/post-increments are not
> >>> available, I think that the algorithm must reduce to the current
> >>> one.
> >> 
> >> Realistically, yes. I can see other targets taking some advantage
> >> of
> >> better pre/postinc addressing eventually though, which is why it's
> >> good to continue evolving the current pass.
> > 
> > Sounds good. Thanks again,
> > Hal
> > 
> >> 
> >> -Andy
> >> 
> >>>> 
> >>>>>> 
> >>>>>> I haven't looked at how LSR handles vectorized loops yet… you
> >>>>>> may
> >>>>>> find some interesting behavior.
> >>>>> 
> >>>>> Why would this be different from any other loops?
> >>>> 
> >>>> I don't know. If the induction variables are never type casted,
> >>>> then
> >>>> I don't expect any fundamental problems. I'm always wary when IR
> >>>> input to a pass changes significantly.
> >>> 
> >>> :)
> >>> 
> >>> Thanks again,
> >>> Hal
> >>> 
> >>>> 
> >>>> -Andy
> >>>> 
> >>>> 
> >> 
> >> 
> 
>