[PATCH] D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand.

Fri Jul 13 15:55:46 PDT 2018

az updated this revision to Diff 155520.
az added a comment.

In https://reviews.llvm.org/D49151#1158061, @efriedma wrote:

> How does the code generation actually change on aarch64?  As far as I can tell, you're basically just changing an "sxtw #2" to an "lsl #2"; does that save a uop on A72?

For the unit test case added in this patch, it does not show much in terms of performance. It only shows how SimplifyIndVar, without the patch, generates different code for similar multiply instructions (%mul = ... and %mul1= ...) that differ only in the first operand being hoisted outside the loop or not. For the hoisted case, it widens the multiply, inserts a sign extend outside the loop, and removes the sign extend instruction that comes after mul. In the non-hoisted case, it adds a truncate instruction and leaves the sign extend after the multiply. This patch tries to make SimplifyIndVar generates the same code for both cases especially that code for the hoisted case seems more efficient and cleaner to work with for passes that runs later. In order to see performance improvement due to this patch, let's consider this C Code: 
struct info1 { int C };
struct info2 { int data };
void foo(struct info1* in, struct info2* out, int N, unsigned char* p)  {

       int p0, p1, p2;
       for (int x = 1; x < N; ++x) {
         p0 = *(p + (x+1) * in->C);
         p1 = *(p + (x-1) * in->C);
         p2 = *(p + (x-2) * in->C);
         out[N + x].data = p0 - p1 + p2;
       }
  return;

}
Without the Patch, here is the AArch64 assembly:

  ldr     w9, [x0]
  add     x11, x1, w2, sxtw #2
  mov     w12, w2
  mov     w8, wzr
  add     x11, x11, #4            // =4
  neg     w10, w9
  sub     x12, x12, #1            // =1
  lsl     w13, w9, #1

.LBB0_2:                                // %for.body

                                      // =>This Inner Loop Header: Depth=1
  add     w15, w13, w8
  ldrb    w14, [x3, w8, sxtw]
  ldrb    w15, [x3, w15, sxtw]
  add     w16, w10, w8
  ldrb    w16, [x3, w16, sxtw]
  add     w8, w8, w9
  subs    x12, x12, #1            // =1
  sub     w14, w15, w14
  add     w14, w14, w16
  str     w14, [x11], #4
  b.ne    .LBB0_2

.LBB0_3:                                // %for.cond.cleanup

  ret

With the patch, here is the AArch64 generated assembly

  ldrsw   x8, [x0]
  add     x10, x1, w2, sxtw #2
  mov     w12, w2
  sub     x12, x12, #1            // =1
  add     x10, x10, #4            // =4
  neg     x9, x8
  lsl     x11, x8, #1

.LBB0_2:                                // %for.body

                                      // =>This Inner Loop Header: Depth=1
  ldrb    w13, [x3, x11]
  ldrb    w15, [x3]
  ldrb    w14, [x3, x9]
  add     x3, x3, x8
  sub     w13, w13, w15
  add     w13, w13, w14
  subs    x12, x12, #1            // =1
  str     w13, [x10], #4
  b.ne    .LBB0_2

.LBB0_3:                                // %for.cond.cleanup

  ret

There is a performance improvement with the patch due to the fact that most variables involved in computing the addresses of the ldrb instructions are computed outside the loop. The redundant truncate and sign extend instructions that goes into loop strength reduction and in particular induction rewrite does not allow this pass to generate the most efficient code.

https://reviews.llvm.org/D49151

Files:
  llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
  llvm/test/Transforms/IndVarSimplify/iv-widen-elim-ext.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D49151.155520.patch
Type: text/x-patch
Size: 11418 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20180713/487fcecd/attachment.bin>