[PATCH] D21363: Strided Memory Access Vectorization

Tue Jun 21 04:17:56 PDT 2016

ashutosh.nema added a comment.

Thanks Ayal for looking into this RFC.

> The performance improvements are relative to scalar, 

>  or relative to vectorizing with scalar loads/stores packed 

>  into vectors?


The performance improvements are relative to scalar.

> For your 1-line example above the latter may not be better,

>  but for more computationally intensive loops it might.


Agree, with more compute in loops the performance will be much better.
I just kept it to demonstrate this feature.


================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:531-542
@@ -530,2 +530,14 @@
 
+  /// \return The cost of the strided memory operation.
+  /// \p Opcode is the memory operation code
+  /// \p VecTy is the vector type of the interleaved access.
+  /// \p Factor is the interleave factor
+  /// \p Indices is the indices for interleaved load members (as interleaved
+  ///    load allows gaps)
+  /// \p Alignment is the alignment of the memory operation
+  /// \p AddressSpace is address space of the pointer.
+  int getStridedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,
+                             ArrayRef<unsigned> Indices, unsigned Alignment,
+                             unsigned AddressSpace) const;
+
   /// \brief Calculate the cost of performing a vector reduction.
----------------
Ayal wrote:
> Why not simply call getInterleavedMemoryOpCost() with a single Index of 0 instead of specializing it to getStridedMemoryOpCost()?
> 
> Yes, we need to implement it for x86.
Approach to get the cost is different i.e. “getInterleavedMemoryOpCost” consider wider load[s] & store[s] and gets the cost. 
Whereas we minimise the number of load[s] & store[s] and computes the cost. 
May not be a good idea to merge them as approach to get the cost itself is different.
I’m open if people still feel it’s good to merge them will do it.


================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:50
@@ +49,3 @@
+/// Offset 0, 3, 6, 9 required to fill vector register.
+/// So 2 vector load will be requied.
+/// NOTE: It assumes all iteration for a given stride holds common memory
----------------
Ayal wrote:
> ashutosh.nema wrote:
> > delena wrote:
> > > But it depends on element size..
> > I did not understood this comment completely.
> > 
> > Are you pointing where vectorizer can go above the target supported width ?
> > I.e.
> > double foo(double *A, double *B, int n) {
> >   double sum = 0;
> > #pragma clang loop vectorize_width(16)
> >   for (int i = 0; i < n; ++i)
> >     sum += A[i] + 5;
> >   return sum;
> > }
> requi[r]ed
> 
> This is counting how many vectors of VF-consecutive-elements-each are needed to cover a set of VF strided elements. To match the number of target loads or stores, the element size would need to be such that VF elements fit in a simd register.
This identifies how many vector[s] (wide VF) are required to cover a set of VF strided elements.
Agree, there can be a possibility that VF wide vector may not fit the target simd register, 
Currently it leaves things to costing, which is not be good as some cases costing goes to 
basic TTI and gets the cost which may not be proper as per hardware.

This should be handled correctly by considering target support.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:61
@@ +60,3 @@
+/// shuffle.
+static unsigned getShuffleRequiredForPromotion(unsigned VF, unsigned Stride) {
+  unsigned ValidElements = getValidMemInSingleVFAccess(VF, Stride);
----------------
Ayal wrote:
> Isn't this the same as getRequiredLoadStore() - 2?
> 
> "number of shuffle[s]"
> 
> What do you mean by "a[n] upper type"?
Sorry, comments for this function is not very clear.
The intent of the function is to identify the number of shuffle required to promote smaller vector type to a bigger vector type.
When types are different we can’t directly shuffle elements because shuffle expects same type
i.e. <4 x i32> & <2 x i32> shuffle is not possible, so its required to promote <2 x i32> to <4 x i32> with undefs vector.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1158
@@ +1157,3 @@
+   assert(isa<VectorType>(VecTy) && "Expect a vector type");
+   if (Factor <= TLI->getMaxSupportedInterleaveFactor()) {
+     // Input is WideVector type (i.e. VectorTy * Stride)
----------------
Ayal wrote:
> Check if ! and return first.
Sure.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1169
@@ +1168,3 @@
+     // Multiply Cost by number of load/store operation.
+     Cost = Cost * MemOpRequired;
+     unsigned ShuffleCost = 0;
----------------
Ayal wrote:
> This may be different than getMemoryOpCost(Opcode, VecTy, Alignment, AddressSpace);
> because we align each load on its first strided element; e.g., {0,3,6,9} can be loaded using 2 loads of {0,3}{6,9} rather than 3 consecutive loads of {0,11} = {0,3}{4,7}{8,11}.
getMemoryOpCost is used to estimate cost of single load or store operation. 
Then we multiply it with number of load[s] or store[s] required to get total load[s] or store[s] cost.
i.e. For VF 4 & Stride 3 the total load[s]/store[s] required are 2(MemOpRequired) so we identify the cost of single load/store than multiply with 2.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1175
@@ +1174,3 @@
+     // i.e. <4 x i32> * <2 x i32> is not possible, so its required
+     // to promote <2 x i32> to <4 x i32> with undefs vector.
+     unsigned TotalSufflePerMemOp = getShuffleRequiredForPromotion(VF, Factor);
----------------
Ayal wrote:
> Explain this also above getShuffleRequiredForPromotion().
> 
> You can shuffle using a reduction binary tree, which may require fewer shuffles. In particular, if getRequiredLoadStore() is a power of 2, no additional promoting shuffles are needed.
> 
> number of shuffle[s]
> 
> can[']t
> 
> Give a full proper example, not one using *
Sure will provide more details here & at getShuffleRequiredForPromotion.
Good idea reduction binary tree can be used here, will try it.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1178
@@ +1177,3 @@
+     // Identify cost of shuffle which required for promotion.
+     // NOTE: This is only applicable for load.
+     if(Opcode == Instruction::Load)
----------------
Ayal wrote:
> For store, don't we need to add the cost of expanding a vector of VF elements to be masked-stored stridedly?
For store we generate shuffle followed by mask-stores, so it’s not required.
It’s only required for load, as while loading because of mismatch in vector type we can’t shuffle directly
So it’s required there to promote smaller vector type using shuffle, but it’s not the case with store.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1182
@@ +1181,3 @@
+         for (unsigned i = 0; i < TotalSufflePerMemOp; i++)
+           ShuffleCost += getShuffleCost(TTI::SK_InsertSubvector, VecTy, 0, SubVecTy);
+     // Identify cost of shuffle required for multiple memory operation.
----------------
Ayal wrote:
> Why not simply multiply ShuffleCost by number of shuffles instead?
> 
> Isn't getShuffle[s]RequiredForPromotion() returning the TotalS[h]uffle[s], rather than the TotalSufflePerMemOp?
Yes this looks bad, will directly multiply.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1185
@@ +1184,3 @@
+     for (unsigned i = 0; i < (MemOpRequired - 1); i++)
+       ShuffleCost += getShuffleCost(TTI::SK_InsertSubvector, SubVecTy, 0, SubVecTy);
+     // Update final cost & return.
----------------
Ayal wrote:
> How many shuffles total?
> 
> Reached this far, to be continued.
Example:
Load for stride 3 & VF8:

1 extra shuffle is required promotion:

Load#1 : 0 * * 3 * * 6 * 
Load#2: 9 * * 12 * * 15 *
Shuffle#1:  ShuffleVector <8 x i32>Load#1, <8 x i32>Load#2, <6 x i32> <i32 0, i32 3, i32 6, i32 8, i32 11, i32 14>
Load#3: 18 * * 21 * * * * 
Shuffle#2 : ShuffleVector <6 x i32>Shuffle#1, <6 x i32>undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 undef, i32 undef>
Shuffle#3 : ShuffleVector <8 x i32>Shuffle#2, <8 x i32>Load#3 , <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 8, i32 11>

Note: “Shuffle#2” is extra generated for vector type promotion.


Repository:
  rL LLVM

http://reviews.llvm.org/D21363