[PATCH/RFC] Pre-increment preparation pass

Mon Feb 4 18:05:32 PST 2013

On Jan 29, 2013, at 1:31 PM, Hal Finkel <hfinkel at anl.gov> wrote:

> Hello again,
> 
> When targeting PPC A2 cores, use of pre-increment loads and stores is very important for performance. The fundamental problem with the generation of pre-increment instructions is that most loops are not naturally written to take advantage of them. For example, a loop written as:
> for (int i = 0; i < N; ++i) {
>  x[i] = y[i]
> }
> needs to be transformed to look more like this:
> T *a = x[-1], *b = y[-1];
> for (int i = 0; i < N; ++i) {
>  *++a = *++b;
> }
> 
> I've attached a pass that performs this transformation. For unrolled loops (or other situations with multiple accesses to the same array), the lowest-offset use is transformed into the pre-increment access and the others are updated to base their addressing off of the pre-increment access. I schedule this pass after LSR is run.
> 
> This seems to work pretty well, but I don't know if this is the best way of doing what I'd like. Should this really be part of LSR somehow?
> 
> In case you're curious, for inner loops (where this really matters), the induction variable is often moved into the special loop counter register, and so removing the dependence of the addressing on the induction variable is also a good thing.

LSR includes an IV chain analysis to that finds pre-inc and post-inc
candidates. It attempts to solve some difficult problems:

- Inform the regular LSR analysis so that it finds the best solution
  assuming that a single register will be reused along all links
  in an IV chain.

- Avoid introducing new induction variables (avoid new phis), in cases
  that involve sign/zero extension and truncation.

- Work around the fact that LSR has no concept of register liveness or
  locality of uses.

- Allow the analysis to scale so it doesn't incur noticeable overhead
  on very large complex loops.

Some problems with the current solution are:

- It only applies to recognizable induction variables in inner
  loops. In your case that may be acceptable. But in general, as a
  code size optimization, we should apply pre/post inc generation
  throughout.

- The heuristics are nearly impossible to get right for all
  loops. Some loops are unrolled to the point that they are perfectly
  register-bound and "optimizing" addressing modes will destroy
  performance. The best choice of addressing mode requires knowlege of
  liveness and scheduling, which LSR does not have. To avoid doing harm,
  currently IV chains are very conservative only clearly benefit
  unrolled loops with dynamic stride.

- Generating IV chains effectively preschedules your memory operations
  to the same objects. In practice, I didn't think it would be a big
  problem. But I could contrive a case that suffers from it.

- I've noticed cases where DAGCombine undoes the chains and fails to
  form pre/post-increment but haven't tracked down the reason.

My initial reaction to your patch is that you should be using IV
chains, which is a complete implementation in terms of safely handling
a variety of loops with internal control flow. It should be possible
to add target hooks and heuristics to work well for bg/q.

Here are some possible next steps for improving pre/post inc
generation:

- Fix DAGCombine so that it preserves the IV chains formed at IR-level.

- Modify LSR to make use of target hooks to detect IV chains that will
  result in pre/post-inc ld/st formation. Use that information to
  guide heuristics so that we generate those chains in more cases,
  rather than purely attempting to reduce register pressure. Handle
  the cases that matter to you without regressing other
  targets. Possibly add some detection of common idioms if that makes
  it easier.

- Add very simple straight-line address-chain formation pass after LSR
  to cleanup simple ld/st sequences. This would need to form phis. It
  also probably could be done without SCEV.

-Andy