[PATCH] D21363: Strided Memory Access Vectorization

Mon Jun 20 10:00:07 PDT 2016

Ayal added a subscriber: Ayal.
Ayal added a comment.

Part 1. To be continued.

The performance improvements are relative to scalar, or relative to vectorizing with scalar loads/stores packed into vectors? For your 1-line example above the latter may not be better, but for more computationally intensive loops it might.


================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:531-542
@@ -530,2 +530,14 @@
 
+  /// \return The cost of the strided memory operation.
+  /// \p Opcode is the memory operation code
+  /// \p VecTy is the vector type of the interleaved access.
+  /// \p Factor is the interleave factor
+  /// \p Indices is the indices for interleaved load members (as interleaved
+  ///    load allows gaps)
+  /// \p Alignment is the alignment of the memory operation
+  /// \p AddressSpace is address space of the pointer.
+  int getStridedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,
+                             ArrayRef<unsigned> Indices, unsigned Alignment,
+                             unsigned AddressSpace) const;
+
   /// \brief Calculate the cost of performing a vector reduction.
----------------
Why not simply call getInterleavedMemoryOpCost() with a single Index of 0 instead of specializing it to getStridedMemoryOpCost()?

Yes, we need to implement it for x86.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:40
@@ +39,3 @@
+static unsigned getValidMemInSingleVFAccess(unsigned VF, unsigned Stride) {
+  return ((VF - 1) / Stride) + 1;
+}
----------------
This "ceiling(VF/Stride)" is not x86 specific.

getNumStridedElementsInConsecutiveVF()? Better to simply inlining this short formula, or call it "ceiling".

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:50
@@ +49,3 @@
+/// Offset 0, 3, 6, 9 required to fill vector register.
+/// So 2 vector load will be requied.
+/// NOTE: It assumes all iteration for a given stride holds common memory
----------------
ashutosh.nema wrote:
> delena wrote:
> > But it depends on element size..
> I did not understood this comment completely.
> 
> Are you pointing where vectorizer can go above the target supported width ?
> I.e.
> double foo(double *A, double *B, int n) {
>   double sum = 0;
> #pragma clang loop vectorize_width(16)
>   for (int i = 0; i < n; ++i)
>     sum += A[i] + 5;
>   return sum;
> }
requi[r]ed

This is counting how many vectors of VF-consecutive-elements-each are needed to cover a set of VF strided elements. To match the number of target loads or stores, the element size would need to be such that VF elements fit in a simd register.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:55
@@ +54,3 @@
+  unsigned VMem = getValidMemInSingleVFAccess(VF, Stride);
+  return (VF / VMem) + ((VF % VMem) ? 1 : 0);
+}
----------------
This is a similar "ceiling(VF/VMem)" computation as before, better use the same formula for clarity. Perhaps the clearest is "(VF+Stride-1) / Stride".

(Or return getValidMemInSingleVFAccess(VF, Vmem) ... ;-)

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:61
@@ +60,3 @@
+/// shuffle.
+static unsigned getShuffleRequiredForPromotion(unsigned VF, unsigned Stride) {
+  unsigned ValidElements = getValidMemInSingleVFAccess(VF, Stride);
----------------
Isn't this the same as getRequiredLoadStore() - 2?

"number of shuffle[s]"

What do you mean by "a[n] upper type"?

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1158
@@ +1157,3 @@
+   assert(isa<VectorType>(VecTy) && "Expect a vector type");
+   if (Factor <= TLI->getMaxSupportedInterleaveFactor()) {
+     // Input is WideVector type (i.e. VectorTy * Stride)
----------------
Check if ! and return first.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1169
@@ +1168,3 @@
+     // Multiply Cost by number of load/store operation.
+     Cost = Cost * MemOpRequired;
+     unsigned ShuffleCost = 0;
----------------
This may be different than getMemoryOpCost(Opcode, VecTy, Alignment, AddressSpace);
because we align each load on its first strided element; e.g., {0,3,6,9} can be loaded using 2 loads of {0,3}{6,9} rather than 3 consecutive loads of {0,11} = {0,3}{4,7}{8,11}.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1175
@@ +1174,3 @@
+     // i.e. <4 x i32> * <2 x i32> is not possible, so its required
+     // to promote <2 x i32> to <4 x i32> with undefs vector.
+     unsigned TotalSufflePerMemOp = getShuffleRequiredForPromotion(VF, Factor);
----------------
Explain this also above getShuffleRequiredForPromotion().

You can shuffle using a reduction binary tree, which may require fewer shuffles. In particular, if getRequiredLoadStore() is a power of 2, no additional promoting shuffles are needed.

number of shuffle[s]

can[']t

Give a full proper example, not one using *

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1178
@@ +1177,3 @@
+     // Identify cost of shuffle which required for promotion.
+     // NOTE: This is only applicable for load.
+     if(Opcode == Instruction::Load)
----------------
For store, don't we need to add the cost of expanding a vector of VF elements to be masked-stored stridedly?

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1182
@@ +1181,3 @@
+         for (unsigned i = 0; i < TotalSufflePerMemOp; i++)
+           ShuffleCost += getShuffleCost(TTI::SK_InsertSubvector, VecTy, 0, SubVecTy);
+     // Identify cost of shuffle required for multiple memory operation.
----------------
Why not simply multiply ShuffleCost by number of shuffles instead?

Isn't getShuffle[s]RequiredForPromotion() returning the TotalS[h]uffle[s], rather than the TotalSufflePerMemOp?

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1185
@@ +1184,3 @@
+     for (unsigned i = 0; i < (MemOpRequired - 1); i++)
+       ShuffleCost += getShuffleCost(TTI::SK_InsertSubvector, SubVecTy, 0, SubVecTy);
+     // Update final cost & return.
----------------
How many shuffles total?

Reached this far, to be continued.


Repository:
  rL LLVM

http://reviews.llvm.org/D21363