[PATCH/RFC] Pre-increment preparation pass

Mon Feb 4 19:33:52 PST 2013

----- Original Message -----
> From: "Andrew Trick" <atrick at apple.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "l >> Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>, "Jakob Stoklund Olesen" <stoklund at 2pi.dk>
> Sent: Monday, February 4, 2013 8:05:32 PM
> Subject: Re: [PATCH/RFC] Pre-increment preparation pass
> 
> On Jan 29, 2013, at 1:31 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> > Hello again,
> > 
> > When targeting PPC A2 cores, use of pre-increment loads and stores
> > is very important for performance. The fundamental problem with
> > the generation of pre-increment instructions is that most loops
> > are not naturally written to take advantage of them. For example,
> > a loop written as:
> > for (int i = 0; i < N; ++i) {
> >  x[i] = y[i]
> > }
> > needs to be transformed to look more like this:
> > T *a = x[-1], *b = y[-1];
> > for (int i = 0; i < N; ++i) {
> >  *++a = *++b;
> > }
> > 
> > I've attached a pass that performs this transformation. For
> > unrolled loops (or other situations with multiple accesses to the
> > same array), the lowest-offset use is transformed into the
> > pre-increment access and the others are updated to base their
> > addressing off of the pre-increment access. I schedule this pass
> > after LSR is run.
> > 
> > This seems to work pretty well, but I don't know if this is the
> > best way of doing what I'd like. Should this really be part of LSR
> > somehow?
> > 
> > In case you're curious, for inner loops (where this really
> > matters), the induction variable is often moved into the special
> > loop counter register, and so removing the dependence of the
> > addressing on the induction variable is also a good thing.
> 
> 
> LSR includes an IV chain analysis to that finds pre-inc and post-inc
> candidates. It attempts to solve some difficult problems:
> 
> - Inform the regular LSR analysis so that it finds the best solution
>   assuming that a single register will be reused along all links
>   in an IV chain.
> 
> - Avoid introducing new induction variables (avoid new phis), in
> cases
>   that involve sign/zero extension and truncation.
> 
> - Work around the fact that LSR has no concept of register liveness
> or
>   locality of uses.
> 
> - Allow the analysis to scale so it doesn't incur noticeable overhead
>   on very large complex loops.
> 
> Some problems with the current solution are:
> 
> - It only applies to recognizable induction variables in inner
>   loops. In your case that may be acceptable. But in general, as a
>   code size optimization, we should apply pre/post inc generation
>   throughout.

Agreed. And you're right, for performance this mostly matters in inner loops.

> 
> - The heuristics are nearly impossible to get right for all
>   loops. Some loops are unrolled to the point that they are perfectly
>   register-bound and "optimizing" addressing modes will destroy
>   performance. The best choice of addressing mode requires knowlege
>   of
>   liveness and scheduling, which LSR does not have. To avoid doing
>   harm,
>   currently IV chains are very conservative only clearly benefit
>   unrolled loops with dynamic stride.

I'll bet.

> 
> - Generating IV chains effectively preschedules your memory
> operations
>   to the same objects. In practice, I didn't think it would be a big
>   problem. But I could contrive a case that suffers from it.

Agreed; but in common cases, this seems to be exactly what I want.

> 
> - I've noticed cases where DAGCombine undoes the chains and fails to
>   form pre/post-increment but haven't tracked down the reason.

Yes. I've started looking at this as well.

> 
> My initial reaction to your patch is that you should be using IV
> chains, which is a complete implementation in terms of safely
> handling
> a variety of loops with internal control flow. It should be possible
> to add target hooks and heuristics to work well for bg/q.

Okay, great.

> 
> Here are some possible next steps for improving pre/post inc
> generation:
> 
> - Fix DAGCombine so that it preserves the IV chains formed at
> IR-level.

If you're talking about what I think you're talking about, at least for the case where all of the offsets are constants, I've already worked on this. I have a patch on the list, see my e-mail titled, "Constant folding around pre-increment loads and stores." This does not generally prevent pre-increment formation, but before this fix, makes the result less useful.

> 
> - Modify LSR to make use of target hooks to detect IV chains that
> will
>   result in pre/post-inc ld/st formation. Use that information to
>   guide heuristics so that we generate those chains in more cases,
>   rather than purely attempting to reduce register pressure. Handle
>   the cases that matter to you without regressing other
>   targets. Possibly add some detection of common idioms if that makes
>   it easier.
> 
> - Add very simple straight-line address-chain formation pass after
> LSR
>   to cleanup simple ld/st sequences. This would need to form phis. It
>   also probably could be done without SCEV.

If you don't mind, I'd appreciate some more specific advice. First, is the current implementation of LSR capable of performing this transformation:
> > for (int i = 0; i < N; ++i) {
> >  x[i] = y[i]
> > }
> > needs to be transformed to look more like this:
> > T *a = x[-1], *b = y[-1];
> > for (int i = 0; i < N; ++i) {
> >  *++a = *++b;
> > }
or is this what the "straight-line address-chain formation pass" you imagine would do? If LSR can do this, what contributes to the decision of whether or not it should be done? In some sense, this is the most important part because this is what enables using the pre-increment forms in the first place. Convincing LSR to otherwise form the chains in unrolled loops seems to be a function of the chain cost functions. Where should I start looking to see how to modify those?

Thanks again,
Hal

> 
> -Andy
> 
>