[PATCH] [LoopVectorize]Teach Loop Vectorizer about interleaved memory access

Thu Apr 30 04:50:46 PDT 2015

Hi aschwaighofer, hfinkel, rengolin, delena, t.p.northover,

Hi,

Early in this month, I added a patch to teach Loop Vectorizer about interleaved data access in D8820. According to the code review comments. I've made a lot of changes. This new patch is attached. It will identify and vectorize interleaved Accesses into "Loads/Stores + ShuffleVectors".
E.g. It can translate following interleaved loads (If vector factor is 4):
    for (i = 0; i < N; i+=3) {
        R = Pic[i];       // Load R color elements
        G = Pic[i+1];     // Load G color elements
        B = Pic[i+2];     // Load B color elements
        ... // do something to R, G, B
    }
Into
    %wide.vec = load <12 x i32>, <12 x i32>* %ptr               ; load for R,G,B
    %R.vec = shufflevector %wide.vec, undef, <0, 3, 6, 9>    ; mask for R load
    %G.vec = shufflevector %wide.vec, undef, <1, 4, 7, 10>  ; mask for G load
    %B.vec = shufflevector %wide.vec, undef, <2, 5, 8, 11>  ; mask for B load
Or it can translate following interleaved stores (If vector factor is 4):
     for (i = 0; i < N; i+=3) {
         ... do something to R, G, B
         Pic[i] = R;     // Store R color elements
         Pic[i+1] = G;     // Store G color elements
         Pic[i+2] = B;     // Store B color elements
     }
Into
     %RG.vec = shufflevector %R.vec, %G.vec, <0, 1, 2, ..., 7>
     %BU.vec = shufflevector %B.vec, undef, <0, 1, 2, 3, u, u, u, u>
     %interleaved.vec = shufflevector %RG.vec, %BU.vec,
                 <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>  ; mask for interleaved store
     store <12 x i32> %interleaved.vec, <12 x i32>* %ptr         ; write for R,G,B

This patch mainly does:
     (1) Identify interleaved access. (As some situation can not be covered corrently, I've added a TODO.)
     (2) Transfer the indentified interleaved access to ShuffleVectors and Load/Store.
     (3) Add a new pass in AArch64 backend to match the interleaved load/store with stride 2,3,4 to ldN/stN intrinsics.

I also added a new target hook to calculate the cost. (As I don't know too much about other targets, I just estimated it roughly.) It can be improved to be more accurate.

For the correctness, I've tested on AArch64 target with LNT, EEMBC, SPEC2000, SPEC2006, which can all pass.
For the performance, as there are other issues could forbid many vectorization opportunities, I don't see obvious improvements. But some benchmarks like EEMBC.RGBcmy and EEMBC.RGByiq are expected to have huge improvements (6 times and 3 times separately). Also I indeed see a some loops are vectorized.

Review please.

Thanks,
-Hao

http://reviews.llvm.org/D9368

Files:
  include/llvm/Analysis/TargetTransformInfo.h
  include/llvm/Analysis/TargetTransformInfoImpl.h
  lib/Analysis/LoopAccessAnalysis.cpp
  lib/Analysis/TargetTransformInfo.cpp
  lib/Target/AArch64/AArch64.h
  lib/Target/AArch64/AArch64ShuffleVectorAndMemAccessOpt.cpp
  lib/Target/AArch64/AArch64TargetMachine.cpp
  lib/Target/AArch64/AArch64TargetTransformInfo.cpp
  lib/Target/AArch64/AArch64TargetTransformInfo.h
  lib/Target/AArch64/CMakeLists.txt
  lib/Transforms/Vectorize/LoopVectorize.cpp
  test/CodeGen/AArch64/shuffle-access-opt.ll
  test/Transforms/LoopVectorize/AArch64/arbitrary-induction-step.ll
  test/Transforms/LoopVectorize/AArch64/interleaved-access.ll

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D9368.24696.patch
Type: text/x-patch
Size: 90127 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150430/dd7f5483/attachment.bin>