[PATCH] D108988: [ARM] Simplify address calculation for NEON load/store
Andrew Savonichev via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Sep 8 11:49:48 PDT 2021
asavonic added a comment.
In D108988#2979176 <https://reviews.llvm.org/D108988#2979176>, @dmgreen wrote:
> My first thought was why can't this be handled by LSR, but I can see how that might not work very well trying to precisely match VLD offsets. And the tests you added have no loops :)
The patch is focused on sequential loads that have the same base
pointer and constant offsets, but it can also work if such sequence is
in a loop body:
void test(float *a, float *b, int n) {
for (int i = 0; i < n; ++i) {
v4f32 A1 = vld1q_f32(a + 16 * i);
v4f32 A2 = vld1q_f32(a + 16 * i + 4);
v4f32 A3 = vld1q_f32(a + 16 * i + 8);
v4f32 A4 = vld1q_f32(a + 16 * i + 12);
vst1q_f32(b + 4 * i, A1);
vst1q_f32(b + 4 * i, A2);
vst1q_f32(b + 4 * i, A3);
vst1q_f32(b + 4 * i, A4);
}
LSR seems to only handle values that are loop IV, so these constant
offsets are not optimized. The loop body is compiled to:
add lr, r0, r3
subs r2, r2, #1
mov r4, lr
vld1.32 {d16, d17}, [r4], r12
vld1.32 {d18, d19}, [r4]
add r4, lr, #32 ; <-- extra address computation
vld1.32 {d20, d21}, [r4]
add r4, lr, #16 ; <--
vld1.32 {d22, d23}, [r4]
add r4, r1, r3
add r5, r4, #16 ; <--
add r3, r3, #64
mov lr, r4
add r4, r4, #32 ; <--
vst1.32 {d16, d17}, [lr], r12
vst1.32 {d22, d23}, [r5]
vst1.32 {d20, d21}, [r4]
vst1.32 {d18, d19}, [lr]
bne .LBB0_2
In the first revision of this patch ARMPostIndexingOpt was confused by
GEP patterns produced by LSR. This is now fixed and the sequence is
optimized to:
add r3, r0, r12
subs r2, r2, #1
vld1.32 {d16, d17}, [r3]!
vld1.32 {d18, d19}, [r3]!
vld1.32 {d20, d21}, [r3]!
vld1.32 {d22, d23}, [r3]
add r3, r1, r12
add r12, r12, #64
vst1.32 {d16, d17}, [r3]!
vst1.32 {d18, d19}, [r3]!
vst1.32 {d20, d21}, [r3]!
vst1.32 {d22, d23}, [r3]
bne .LBB0_2
> (I also had some thoughts about whether this was useful in general, or if a sufficiently powerful cpu would break these into microops in either case, leading to the same performance in the end. But the code does look cleaner now, I can see how it would improve things)
I've measured execution time of the loop above, and it is ~7% faster
on Cortex-A72. It may be different on other hardware though.
> The way we handled this in MVE was to "distribute" the increments in the ARMLoadStoreOptimizer pass. The instructions in MVE are different, and that does involve checking through Machine Instructions for Adds that can be better distributed into postinc instructions. LSR got it mostly right, DAG Combine did an OKish job most of the time, and we fixed up what went wrong later in the pipeline.
>
> It seems to have worked out OK as far as I can tell, is there a reason we can't do the same thing here?
I think the approach is still the same, the new pass just works for
cases that LSR does not handle.
> Adding the new pass seems fine if we need it, but I'm less sanguine about having to disable a lot of Add folds in DAGCombiner.
Agree, this is potentially the most problematic change. It is limited
to just (load/store (add)) and works only before legalization, so this
/hopefully/ reduces its impact to just the patterns we need.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D108988/new/
https://reviews.llvm.org/D108988
More information about the llvm-commits
mailing list