[PATCH] D36130: [SLP] Vectorize jumbled memory loads.

Wed Feb 14 02:38:12 PST 2018

ashahid added a comment.

Hi Alexey,

Thanks for looking into it.I will update it accordingly. 
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

In https://reviews.llvm.org/D36130#1006238, @ABataev wrote:

> In https://reviews.llvm.org/D36130#1006202, @ashahid wrote:
>
> > Hi Alexey,
> >
> > Thanks for looking into it.I will update it accordingly. 
> >  BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?
>
>
> Probably, it is hard to say exactly without looking at the result.

No worry it was a merge issue, its fixed.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:736-742
+    /// Records optional shuffle mask for the uses of jumbled memory accesses.
+    /// For example, a non-empty ShuffleMask[1] represents the permutation of
+    /// lanes that operand #1 of this vectorized instruction should undergo
+    /// before feeding this vectorized instruction, whereas an empty
+    /// ShuffleMask[0] indicates that the lanes of operand #0 of this vectorized
+    /// instruction need not be permuted at all.
+    SmallVector<SmallVector<unsigned, 4>, 2> ShuffleMask;
----------------
ABataev wrote:
> ashahid wrote:
> > ABataev wrote:
> > > Why you can't have just one shuffle here for all external uses?
> > This is for in-tree multi uses of a single vector load where the uses has different masks/permutation.
> > This section of comment https://reviews.llvm.org/D36130#inline-326711
> > discussed it earlier. Also there is figure attached.
> I still don't understand what's the problem here.
> 1. You need to perform the loads in some order. 
> 2. You sort the loads to be in the sequntially direct order and perform the vector load starting from the lowest address.
> 3. You reshuffle the loaded vector value to the original order.
> That's it, you have your loads in the required order. Just one shuffle is required. Why do you need some more? Also, I don't understand why do you need so many changes, why do you need additional indicies etc.
Updated jumbled-load.ll captures this case where instead of gathering the second operand of MUL we can have required shuffle of the same loaded vector

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1661
+
+      if (VL.size() > 2) {
+        bool ShuffledLoads = true;
----------------
ABataev wrote:
> ashahid wrote:
> > ABataev wrote:
> > > Is it possible at all that `VL` has less than 4 elements here?
> > I think yes, for example a couple of i64 loads considering minimum register width as 128-bit.
> > 
> > However, this check here was basically meant to indicate jumbled loads of size 2 is essentially a reversed load.
> It is going to be handled by the reverse loads patch
Yes, this check no more required.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1677-1678
+                            "permutation of loaded lanes.\n");
+            newTreeEntry(Sorted, true, UserTreeIdx,
+                         makeArrayRef(Mask.begin(), Mask.end()), OpdNum);
+            return;
----------------
ABataev wrote:
> ashahid wrote:
> > ABataev wrote:
> > > Bad decision. It is better to use original `VL` here, rather than `Sorted` and add an additional array of sorted indieces. In this case you don't need all these additional numbers and all that complex logic to find the correct tree entry for the list of values.
> > In fact earlier design in patch (https://reviews.llvm.org/D26905) was to use original VL, however there was counter argument to that which I don't remember exactly.
> It is better to use original `VL` here, otherwise it will end with a lot of troubles and will require the whole bunch of changes in the vectorization process to find the perfect match for the vector of vectorized values. I don't think it is a good idea to have a lot of changes accross the whole module to handle jumbled loads.
In the context where we can have multiple user of loaded vector with different shuffle mask, the design is to represent these different shuffle mask for each user corresponding to the user's operand number. Having single sorted indices will not be sufficient for this.
Given the objective of handling multiple out of order uses changes are not that big I feel.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:3063
+        isJumbled = true;
+        LI = cast<LoadInst>(E->Scalars[0]);
+      } else {
----------------
ABataev wrote:
> Is this correct? `E->Scalars[0]` is exactly `VL0`
Ah, both are same.

Repository:
  rL LLVM

https://reviews.llvm.org/D36130