[all-commits] [llvm/llvm-project] dc8a41: [ARM] Simplify address calculation for NEON load/s...

Thu Oct 14 05:26:24 PDT 2021

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: dc8a41de34933bc10c4d5d89c539dd0dc80d59cc
      https://github.com/llvm/llvm-project/commit/dc8a41de34933bc10c4d5d89c539dd0dc80d59cc
  Author: Andrew Savonichev <andrew.savonichev at gmail.com>
  Date:   2021-10-14 (Thu, 14 Oct 2021)

  Changed paths:
    M llvm/lib/Target/ARM/ARMISelLowering.cpp
    M llvm/test/CodeGen/ARM/alloc-no-stack-realign.ll
    A llvm/test/CodeGen/ARM/arm-post-indexing-opt.ll
    M llvm/test/CodeGen/ARM/fp16-vector-argument.ll
    M llvm/test/CodeGen/ARM/large-vector.ll
    M llvm/test/CodeGen/ARM/memcpy-inline.ll
    M llvm/test/CodeGen/ARM/memset-align.ll
    M llvm/test/CodeGen/ARM/misched-fusion-aes.ll
    M llvm/test/CodeGen/ARM/vector-load.ll
    M llvm/test/CodeGen/ARM/vext.ll
    M llvm/test/CodeGen/ARM/vselect_imax.ll
    M llvm/test/Transforms/LoopStrengthReduce/ARM/ivchain-ARM.ll

  Log Message:
  -----------
  [ARM] Simplify address calculation for NEON load/store

The patch attempts to optimize a sequence of SIMD loads from the same
base pointer:

    %0 = gep float*, float* base, i32 4
    %1 = bitcast float* %0 to <4 x float>*
    %2 = load <4 x float>, <4 x float>* %1
    ...
    %n1 = gep float*, float* base, i32 N
    %n2 = bitcast float* %n1 to <4 x float>*
    %n3 = load <4 x float>, <4 x float>* %n2

For AArch64 the compiler generates a sequence of LDR Qt, [Xn, #16].
However, 32-bit NEON VLD1/VST1 lack the [Wn, #imm] addressing mode, so
the address is computed before every ld/st instruction:

    add r2, r0, #32
    add r0, r0, #16
    vld1.32 {d18, d19}, [r2]
    vld1.32 {d22, d23}, [r0]

This can be improved by computing address for the first load, and then
using a post-indexed form of VLD1/VST1 to load the rest:

    add r0, r0, #16
    vld1.32 {d18, d19}, [r0]!
    vld1.32 {d22, d23}, [r0]

In order to do that, the patch adds more patterns to DAGCombine:

  - (load (add ptr inc1)) and (add ptr inc2) are now folded if inc1
    and inc2 are constants.

  - (or ptr inc) is now recognized as a pointer increment if ptr is
    sufficiently aligned.

In addition to that, we now search for all possible base updates and
then pick the best one.

Differential Revision: https://reviews.llvm.org/D108988