[PATCH] D21363: Strided Memory Access Vectorization
Ayal Zaks via llvm-commits
llvm-commits at lists.llvm.org
Mon Jun 20 10:00:07 PDT 2016
Ayal added a subscriber: Ayal.
Ayal added a comment.
Part 1. To be continued.
The performance improvements are relative to scalar, or relative to vectorizing with scalar loads/stores packed into vectors? For your 1-line example above the latter may not be better, but for more computationally intensive loops it might.
================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:531-542
@@ -530,2 +530,14 @@
+ /// \return The cost of the strided memory operation.
+ /// \p Opcode is the memory operation code
+ /// \p VecTy is the vector type of the interleaved access.
+ /// \p Factor is the interleave factor
+ /// \p Indices is the indices for interleaved load members (as interleaved
+ /// load allows gaps)
+ /// \p Alignment is the alignment of the memory operation
+ /// \p AddressSpace is address space of the pointer.
+ int getStridedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,
+ ArrayRef<unsigned> Indices, unsigned Alignment,
+ unsigned AddressSpace) const;
+
/// \brief Calculate the cost of performing a vector reduction.
----------------
Why not simply call getInterleavedMemoryOpCost() with a single Index of 0 instead of specializing it to getStridedMemoryOpCost()?
Yes, we need to implement it for x86.
================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:40
@@ +39,3 @@
+static unsigned getValidMemInSingleVFAccess(unsigned VF, unsigned Stride) {
+ return ((VF - 1) / Stride) + 1;
+}
----------------
This "ceiling(VF/Stride)" is not x86 specific.
getNumStridedElementsInConsecutiveVF()? Better to simply inlining this short formula, or call it "ceiling".
================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:50
@@ +49,3 @@
+/// Offset 0, 3, 6, 9 required to fill vector register.
+/// So 2 vector load will be requied.
+/// NOTE: It assumes all iteration for a given stride holds common memory
----------------
ashutosh.nema wrote:
> delena wrote:
> > But it depends on element size..
> I did not understood this comment completely.
>
> Are you pointing where vectorizer can go above the target supported width ?
> I.e.
> double foo(double *A, double *B, int n) {
> double sum = 0;
> #pragma clang loop vectorize_width(16)
> for (int i = 0; i < n; ++i)
> sum += A[i] + 5;
> return sum;
> }
requi[r]ed
This is counting how many vectors of VF-consecutive-elements-each are needed to cover a set of VF strided elements. To match the number of target loads or stores, the element size would need to be such that VF elements fit in a simd register.
================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:55
@@ +54,3 @@
+ unsigned VMem = getValidMemInSingleVFAccess(VF, Stride);
+ return (VF / VMem) + ((VF % VMem) ? 1 : 0);
+}
----------------
This is a similar "ceiling(VF/VMem)" computation as before, better use the same formula for clarity. Perhaps the clearest is "(VF+Stride-1) / Stride".
(Or return getValidMemInSingleVFAccess(VF, Vmem) ... ;-)
================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:61
@@ +60,3 @@
+/// shuffle.
+static unsigned getShuffleRequiredForPromotion(unsigned VF, unsigned Stride) {
+ unsigned ValidElements = getValidMemInSingleVFAccess(VF, Stride);
----------------
Isn't this the same as getRequiredLoadStore() - 2?
"number of shuffle[s]"
What do you mean by "a[n] upper type"?
================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1158
@@ +1157,3 @@
+ assert(isa<VectorType>(VecTy) && "Expect a vector type");
+ if (Factor <= TLI->getMaxSupportedInterleaveFactor()) {
+ // Input is WideVector type (i.e. VectorTy * Stride)
----------------
Check if ! and return first.
================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1169
@@ +1168,3 @@
+ // Multiply Cost by number of load/store operation.
+ Cost = Cost * MemOpRequired;
+ unsigned ShuffleCost = 0;
----------------
This may be different than getMemoryOpCost(Opcode, VecTy, Alignment, AddressSpace);
because we align each load on its first strided element; e.g., {0,3,6,9} can be loaded using 2 loads of {0,3}{6,9} rather than 3 consecutive loads of {0,11} = {0,3}{4,7}{8,11}.
================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1175
@@ +1174,3 @@
+ // i.e. <4 x i32> * <2 x i32> is not possible, so its required
+ // to promote <2 x i32> to <4 x i32> with undefs vector.
+ unsigned TotalSufflePerMemOp = getShuffleRequiredForPromotion(VF, Factor);
----------------
Explain this also above getShuffleRequiredForPromotion().
You can shuffle using a reduction binary tree, which may require fewer shuffles. In particular, if getRequiredLoadStore() is a power of 2, no additional promoting shuffles are needed.
number of shuffle[s]
can[']t
Give a full proper example, not one using *
================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1178
@@ +1177,3 @@
+ // Identify cost of shuffle which required for promotion.
+ // NOTE: This is only applicable for load.
+ if(Opcode == Instruction::Load)
----------------
For store, don't we need to add the cost of expanding a vector of VF elements to be masked-stored stridedly?
================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1182
@@ +1181,3 @@
+ for (unsigned i = 0; i < TotalSufflePerMemOp; i++)
+ ShuffleCost += getShuffleCost(TTI::SK_InsertSubvector, VecTy, 0, SubVecTy);
+ // Identify cost of shuffle required for multiple memory operation.
----------------
Why not simply multiply ShuffleCost by number of shuffles instead?
Isn't getShuffle[s]RequiredForPromotion() returning the TotalS[h]uffle[s], rather than the TotalSufflePerMemOp?
================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1185
@@ +1184,3 @@
+ for (unsigned i = 0; i < (MemOpRequired - 1); i++)
+ ShuffleCost += getShuffleCost(TTI::SK_InsertSubvector, SubVecTy, 0, SubVecTy);
+ // Update final cost & return.
----------------
How many shuffles total?
Reached this far, to be continued.
Repository:
rL LLVM
http://reviews.llvm.org/D21363
More information about the llvm-commits
mailing list