[PATCH] D108988: [ARM] Simplify address calculation for NEON load/store

Wed Sep 8 11:49:48 PDT 2021

asavonic added a comment.

In D108988#2979176 <https://reviews.llvm.org/D108988#2979176>, @dmgreen wrote:

> My first thought was why can't this be handled by LSR, but I can see how that might not work very well trying to precisely match VLD offsets. And the tests you added have no loops :)

The patch is focused on sequential loads that have the same base
pointer and constant offsets, but it can also work if such sequence is
in a loop body:

  void test(float *a, float *b, int n) {
    for (int i = 0; i < n; ++i) {
      v4f32 A1 = vld1q_f32(a + 16 * i);
      v4f32 A2 = vld1q_f32(a + 16 * i + 4);
      v4f32 A3 = vld1q_f32(a + 16 * i + 8);
      v4f32 A4 = vld1q_f32(a + 16 * i + 12);
      vst1q_f32(b + 4 * i, A1);
      vst1q_f32(b + 4 * i, A2);
      vst1q_f32(b + 4 * i, A3);
      vst1q_f32(b + 4 * i, A4);
    }

LSR seems to only handle values that are loop IV, so these constant
offsets are not optimized. The loop body is compiled to:

  add     lr, r0, r3
  subs    r2, r2, #1
  mov     r4, lr
  vld1.32 {d16, d17}, [r4], r12
  vld1.32 {d18, d19}, [r4]
  add     r4, lr, #32              ; <-- extra address computation
  vld1.32 {d20, d21}, [r4]
  add     r4, lr, #16              ; <--
  vld1.32 {d22, d23}, [r4]
  add     r4, r1, r3
  add     r5, r4, #16              ; <--
  add     r3, r3, #64
  mov     lr, r4
  add     r4, r4, #32              ; <--
  vst1.32 {d16, d17}, [lr], r12
  vst1.32 {d22, d23}, [r5]
  vst1.32 {d20, d21}, [r4]
  vst1.32 {d18, d19}, [lr]
  bne     .LBB0_2

In the first revision of this patch ARMPostIndexingOpt was confused by
GEP patterns produced by LSR. This is now fixed and the sequence is
optimized to:

  add     r3, r0, r12
  subs    r2, r2, #1
  vld1.32 {d16, d17}, [r3]!
  vld1.32 {d18, d19}, [r3]!
  vld1.32 {d20, d21}, [r3]!
  vld1.32 {d22, d23}, [r3]
  add     r3, r1, r12
  add     r12, r12, #64
  vst1.32 {d16, d17}, [r3]!
  vst1.32 {d18, d19}, [r3]!
  vst1.32 {d20, d21}, [r3]!
  vst1.32 {d22, d23}, [r3]
  bne     .LBB0_2

> (I also had some thoughts about whether this was useful in general, or if a sufficiently powerful cpu would break these into microops in either case, leading to the same performance in the end. But the code does look cleaner now, I can see how it would improve things)

I've measured execution time of the loop above, and it is ~7% faster
on Cortex-A72. It may be different on other hardware though.

> The way we handled this in MVE was to "distribute" the increments in the ARMLoadStoreOptimizer pass. The instructions in MVE are different, and that does involve checking through Machine Instructions for Adds that can be better distributed into postinc instructions. LSR got it mostly right, DAG Combine did an OKish job most of the time, and we fixed up what went wrong later in the pipeline.
>
> It seems to have worked out OK as far as I can tell, is there a reason we can't do the same thing here?

I think the approach is still the same, the new pass just works for
cases that LSR does not handle.

> Adding the new pass seems fine if we need it, but I'm less sanguine about having to disable a lot of Add folds in DAGCombiner.

Agree, this is potentially the most problematic change. It is limited
to just (load/store (add)) and works only before legalization, so this
/hopefully/ reduces its impact to just the patterns we need.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D108988/new/

https://reviews.llvm.org/D108988