[PATCH] D21363: Strided Memory Access Vectorization

Tue Jun 14 23:13:13 PDT 2016

ashutosh.nema created this revision.
ashutosh.nema added a subscriber: llvm-commits.
ashutosh.nema set the repository for this revision to rL LLVM.
Herald added subscribers: mzolotukhin, sanjoy.

<< This patch is to demonstrate the Strided Memory Access Vectorization >>

Currently LLVM does not support strided access vectorization completely, some support is available via Interleave vectorization.

There are two main overheads with strided vectorizations:
•	An overhead of consolidating data into an operable vector.
•	An overhead of distributing the data elements after the operations.

Because of these overheads LLVM finds strided memory vectorization is not profitable and generating serial code sequence over vectorized code.
GATHER & SCATTER instruction support is available on few(not all) targets for consolidation & distribution operations.

In this approach we are trying to handle cases like this:
for (int i = 0; i < len; i++)
     a[i*3] = b[i*2] + c[i*3];

We model strided memory load & store using shuffle & load/mask-store operations.
•	Load is modeled as loads followed by shuffle.
•	Store is modeled as shuffle followed by mask store.
•	To minimize load and store operation introduced ‘SkipFactor’.

‘SkipFactor’:
•	Multiple load operation required for consolidating data into an operable vector. 
•	Between various loads we skip by some offsets to effective consolidate.
•	SkipFactor is the number of additional offsets required to move from the previous vector load memory-end address for the next vector load.
•	SkipFactor allows all memory loads to be considered as identical (From valid element perspective).
•	SkipFactor = (Stride - (VF % Stride)) % Stride)

How LOAD is modeled:
Load stride 3 (i.e. load for b [ 3 * i ])
  %5 = getelementptr inbounds i32, i32* %b, i64 %.lhs
  %6 = bitcast i32* %5 to <4 x i32>*
  %stride.load27 = load <4 x i32>, <4 x i32>* %6, align 1, !tbaa !1
  %7 = getelementptr i32, i32* %5, i64 6 
  %8 = bitcast i32* %7 to <4 x i32>*
  %stride.load28 = load <4 x i32>, <4 x i32>* %8, align 1, !tbaa !1
  %strided.vec29 = shufflevector <4 x i32> %stride.load27, <4 x i32> %stride.load28, <4 x i32> <i32 0, i32 3, i32 4, i32 7>

How STORE is modeled:
Store with stride 3 (i.e. store to c [ 3 * i ])
 %10 = getelementptr inbounds i32, i32* %c, i64 %.lhs
  %11 = bitcast i32* %10 to <4 x i32>*
  %interleaved.vec = shufflevector <4 x i32> %StoreResult, <4 x i32> undef, <4 x i32> <i32 0, i32 undef, i32 undef, i32 1>
  call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec, <4 x i32>* %11, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)
  %12 = getelementptr i32, i32* %10, i64 6
  %13 = bitcast i32* %12 to <4 x i32>*
  %interleaved.vec30 = shufflevector <4 x i32> %StoreResult, <4 x i32> undef, <4 x i32> <i32 2, i32 undef, i32 undef, i32 3>
  call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec30, <4 x i32>* %13, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)

NOTE: For Stride 3 SkipFactor is 2, that’s why the second GEP offset by 6(4+2).

To enable this feature below LLVM changes required.
1)	Identify strided memory access (Its already there i.e. interleave vectorizer does this)
2)	Costing changes:
a.	Identify number of Load[s]/Store[s] & Shuffle[s] required to model Load/Store operation by considering SkipFactor.
b.	Return total cost by adding Load[s]/Store[s] and shuffle[s] costs.
3)	Transform:
a.	Generate Shuffle[s] followed by Mask-Store[s] instructions to model a Store operation.
b.	Generate Load[s] followed by Shuffle[s] instructions to model a Load operation.

Use below option to enable this feature:
“-mllvm -enable-interleaved-mem-accesses -mllvm -enable-strided-vectorization”

Gains observed with prototype:
TSVC kernel S111 		1.15x
TSVC kernel S1111 		1.42x

Thanks,
Ashutosh

Repository:
  rL LLVM

http://reviews.llvm.org/D21363

Files:
  include/llvm/Analysis/TargetTransformInfo.h
  include/llvm/Analysis/TargetTransformInfoImpl.h
  include/llvm/CodeGen/BasicTTIImpl.h
  lib/Analysis/TargetTransformInfo.cpp
  lib/Target/X86/X86ISelLowering.h
  lib/Target/X86/X86TargetTransformInfo.cpp
  lib/Target/X86/X86TargetTransformInfo.h
  lib/Transforms/Vectorize/LoopVectorize.cpp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D21363.60792.patch
Type: text/x-patch
Size: 45614 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160615/98724315/attachment.bin>