[PATCH] D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand.
Abderrazek Zaafrani via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Fri Jul 13 15:55:46 PDT 2018
az updated this revision to Diff 155520.
az added a comment.
In https://reviews.llvm.org/D49151#1158061, @efriedma wrote:
> How does the code generation actually change on aarch64? As far as I can tell, you're basically just changing an "sxtw #2" to an "lsl #2"; does that save a uop on A72?
For the unit test case added in this patch, it does not show much in terms of performance. It only shows how SimplifyIndVar, without the patch, generates different code for similar multiply instructions (%mul = ... and %mul1= ...) that differ only in the first operand being hoisted outside the loop or not. For the hoisted case, it widens the multiply, inserts a sign extend outside the loop, and removes the sign extend instruction that comes after mul. In the non-hoisted case, it adds a truncate instruction and leaves the sign extend after the multiply. This patch tries to make SimplifyIndVar generates the same code for both cases especially that code for the hoisted case seems more efficient and cleaner to work with for passes that runs later. In order to see performance improvement due to this patch, let's consider this C Code:
struct info1 { int C };
struct info2 { int data };
void foo(struct info1* in, struct info2* out, int N, unsigned char* p) {
int p0, p1, p2;
for (int x = 1; x < N; ++x) {
p0 = *(p + (x+1) * in->C);
p1 = *(p + (x-1) * in->C);
p2 = *(p + (x-2) * in->C);
out[N + x].data = p0 - p1 + p2;
}
return;
}
Without the Patch, here is the AArch64 assembly:
ldr w9, [x0]
add x11, x1, w2, sxtw #2
mov w12, w2
mov w8, wzr
add x11, x11, #4 // =4
neg w10, w9
sub x12, x12, #1 // =1
lsl w13, w9, #1
.LBB0_2: // %for.body
// =>This Inner Loop Header: Depth=1
add w15, w13, w8
ldrb w14, [x3, w8, sxtw]
ldrb w15, [x3, w15, sxtw]
add w16, w10, w8
ldrb w16, [x3, w16, sxtw]
add w8, w8, w9
subs x12, x12, #1 // =1
sub w14, w15, w14
add w14, w14, w16
str w14, [x11], #4
b.ne .LBB0_2
.LBB0_3: // %for.cond.cleanup
ret
With the patch, here is the AArch64 generated assembly
ldrsw x8, [x0]
add x10, x1, w2, sxtw #2
mov w12, w2
sub x12, x12, #1 // =1
add x10, x10, #4 // =4
neg x9, x8
lsl x11, x8, #1
.LBB0_2: // %for.body
// =>This Inner Loop Header: Depth=1
ldrb w13, [x3, x11]
ldrb w15, [x3]
ldrb w14, [x3, x9]
add x3, x3, x8
sub w13, w13, w15
add w13, w13, w14
subs x12, x12, #1 // =1
str w13, [x10], #4
b.ne .LBB0_2
.LBB0_3: // %for.cond.cleanup
ret
There is a performance improvement with the patch due to the fact that most variables involved in computing the addresses of the ldrb instructions are computed outside the loop. The redundant truncate and sign extend instructions that goes into loop strength reduction and in particular induction rewrite does not allow this pass to generate the most efficient code.
https://reviews.llvm.org/D49151
Files:
llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
llvm/test/Transforms/IndVarSimplify/iv-widen-elim-ext.ll
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D49151.155520.patch
Type: text/x-patch
Size: 11418 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20180713/487fcecd/attachment.bin>
More information about the llvm-commits
mailing list