[PATCH] D36130: [SLP] Vectorize jumbled memory loads.

Tue Feb 13 06:47:19 PST 2018

ashahid added a comment.

Hi Alexey,

Thanks for looking into it.I will update it accordingly. 
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1112
+// accesses are entered into the map.
+bool llvm::sortLoadAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
+                            ScalarEvolution &SE,
----------------
ABataev wrote:
> This function can be used for stores also, it is better to make it universal for stores/loads.
I plan to do such improvement in separate patches.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:736-742
+    /// Records optional shuffle mask for the uses of jumbled memory accesses.
+    /// For example, a non-empty ShuffleMask[1] represents the permutation of
+    /// lanes that operand #1 of this vectorized instruction should undergo
+    /// before feeding this vectorized instruction, whereas an empty
+    /// ShuffleMask[0] indicates that the lanes of operand #0 of this vectorized
+    /// instruction need not be permuted at all.
+    SmallVector<SmallVector<unsigned, 4>, 2> ShuffleMask;
----------------
ABataev wrote:
> Why you can't have just one shuffle here for all external uses?
This is for in-tree multi uses of a single vector load where the uses has different masks/permutation.
This section of comment https://reviews.llvm.org/D36130#inline-326711
discussed it earlier. Also there is figure attached.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1661
+
+      if (VL.size() > 2) {
+        bool ShuffledLoads = true;
----------------
ABataev wrote:
> Is it possible at all that `VL` has less than 4 elements here?
I think yes, for example a couple of i64 loads considering minimum register width as 128-bit.

However, this check here was basically meant to indicate jumbled loads of size 2 is essentially a reversed load.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1677-1678
+                            "permutation of loaded lanes.\n");
+            newTreeEntry(Sorted, true, UserTreeIdx,
+                         makeArrayRef(Mask.begin(), Mask.end()), OpdNum);
+            return;
----------------
ABataev wrote:
> Bad decision. It is better to use original `VL` here, rather than `Sorted` and add an additional array of sorted indieces. In this case you don't need all these additional numbers and all that complex logic to find the correct tree entry for the list of values.
In fact earlier design in patch (https://reviews.llvm.org/D26905) was to use original VL, however there was counter argument to that which I don't remember exactly.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:2229-2235
+        for (const SmallVector<unsigned, 4> &Mask :
+             UserTreeEntry->ShuffleMask) {
+          if (!Mask.empty()) {
+            VecLdCost += TTI->getShuffleCost(
+                TargetTransformInfo::SK_PermuteSingleSrc, VecTy, 0);
+          }
+        }
----------------
ABataev wrote:
> You don't need so many shuffles, it is enough just to have just one. 
This is basically for multiple in-tree uses with different masks/permutation.

Repository:
  rL LLVM

https://reviews.llvm.org/D36130