[PATCH/RFC] Pre-increment preparation pass

Mon Feb 4 23:06:17 PST 2013

----- Original Message -----
> From: "Andrew Trick" <atrick at apple.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "l >> Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>, "Jakob Stoklund Olesen" <stoklund at 2pi.dk>
> Sent: Monday, February 4, 2013 10:52:30 PM
> Subject: Re: [PATCH/RFC] Pre-increment preparation pass
> 
> 
> On Feb 4, 2013, at 7:33 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> > 
> > If you don't mind, I'd appreciate some more specific advice. First,
> > is the current implementation of LSR capable of performing this
> > transformation:
> >>> for (int i = 0; i < N; ++i) {
> >>> x[i] = y[i]
> >>> }
> >>> needs to be transformed to look more like this:
> >>> T *a = x[-1], *b = y[-1];
> >>> for (int i = 0; i < N; ++i) {
> >>> *++a = *++b;
> >>> }
> > or is this what the "straight-line address-chain formation pass"
> > you imagine would do? If LSR can do this, what contributes to the
> > decision of whether or not it should be done? In some sense, this
> > is the most important part because this is what enables using the
> > pre-increment forms in the first place. Convincing LSR to
> > otherwise form the chains in unrolled loops seems to be a function
> > of the chain cost functions. Where should I start looking to see
> > how to modify those?
> 
> Now, regarding modifying the LSR pass itself. The chain heuristics
> are in isProfitableIncrement and isProfitableChain.
> 
> It should be able to handle the above loop. I might be more likely to
> handle an unrolled version, but that's a matter of heuristics.

A few questions:

  // Limit the number of chains to avoid quadratic behavior. We don't expect to
  // have more than a few IV increment chains in a loop. Missing a Chain falls
  // back to normal LSR behavior for those uses.
  static const unsigned MaxChains = 8;

So in my simple loop, are both x and y a chain (so those are two chains)? x[i+1] and y[i+1] should be additional members in the same two chains, right? Does this MaxChains limit of 8 chains prevent the formation of more than 8 pointer-types phi variables? PPC has 32 GPRs, so maybe 16 is a more reasonable limit?

/// operator< - Choose the lower cost.
bool Cost::operator<(const Cost &Other) const {
  if (NumRegs != Other.NumRegs)
    return NumRegs < Other.NumRegs;
  if (AddRecCost != Other.AddRecCost)
    return AddRecCost < Other.AddRecCost;
  if (NumIVMuls != Other.NumIVMuls)
    return NumIVMuls < Other.NumIVMuls;
  if (NumBaseAdds != Other.NumBaseAdds)
    return NumBaseAdds < Other.NumBaseAdds;
  if (ImmCost != Other.ImmCost)
    return ImmCost < Other.ImmCost;
  if (SetupCost != Other.SetupCost)
    return SetupCost < Other.SetupCost;
  return false;
}

Is this the thing that I need to look at changing? This seems to be prioritizing for register pressure. I can imagine this was most important on x86, but maybe not exactly what I want. Maybe I need to prioritize NumBaseAdds, etc. over NumRegs if NumRegs is sufficiently small (say, < 16)?

  // A complete chain likely eliminates the need for keeping the original IV in
  // a register. LSR does not currently know how to form a complete chain unless
  // the header phi already exists.
  if (isa<PHINode>(Chain.tailUserInst())
      && SE.getSCEV(Chain.tailUserInst()) == Chain.Incs[0].IncExpr) {
    --cost;
  }

Maybe I want some extra logic here as well? PPC (like Hexagon) has a hardware loops that can hide inner-loop induction variables in special registers in simple cases.

The TTI header has:
  /// isLegalAddressingMode - Return true if the addressing mode represented by
  /// AM is legal for this target, for a load/store of the specified type.
  /// The type may be VoidTy, in which case only return true if the addressing
  /// mode is legal for a load/store of any legal type.
  /// TODO: Handle pre/postinc as well.
  virtual bool isLegalAddressingMode(Type *Ty, GlobalValue *BaseGV,
                                     int64_t BaseOffset, bool HasBaseReg,
                                     int64_t Scale) const;
would I need to extend this to remove the TODO, and then tie that into the costing function somehow? Perhaps the problem in general is that LSR has no special handling of pre/postinc modes.

Anything else that you can think of? 

> 
> I haven't looked at how LSR handles vectorized loops yet… you may
> find some interesting behavior.

Why would this be different from any other loops?

Thanks again,
Hal

> 
> -Andy