[PATCH] D22554: [SLP] Vectorize loads in horizontal reductions when they happen to be in the reverse order

Thu Jul 21 15:09:33 PDT 2016

mkuper added a comment.

Thanks a lot for the review, Matt!

In https://reviews.llvm.org/D22554#491805, @mssimpso wrote:

> I agree with most of your high-level points. We shouldn't really need to pay the cost of rebuilding the entire tree when we find that the loads are unordered. In fact, except for loads and stores, the order of the instructions shouldn't really matter at all. The fact that we assume an order in some cases is a problem that should be fixable. For example, I think VL[0] is commonly used as a representative of the bundle for things like getting the type, not to imply an ordering. I bet we could eliminate most instances of VL[0] with appropriate additions to TreeEntry. And we keep track of a Lane index for ExternalUsers just so we can extract the correct value from the vectors. But this is already kind of redundant since ExternalUser also holds a pointer to the scalar value and we maintain a list of the scalar values in TreeEntry.


Yes, that seems fairly simple in theory, but I tried to actually implement something like this - without rewriting the whole thing from scratch :-) - and ran into more trouble than I expected.
Then again, it may just be my lack of familiarity with the code.

> So The difficult case seems to be the store-rooted trees with unordered loads. For that, we could use shuffles like you mentioned.


I think It's a bit more complex than that.
I agree the most difficult case is store-rooted trees with unordered loads, but there are at least two other issues that aren't related to the technical problems with getting in-tree sorting to work:

1. Even for non-store-rooted trees, we may have several load leaves in the tree, and they may need different orders.
2. There may be a cost modeling issue - I'm not sure a wide load + shuffles is always better than scalar loads. The decision may depend on the width of the vector and on how expensive the specific shuffle is.


================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:884
@@ -883,3 +883,3 @@
 
   // Number of load-bundles of size 2, which are consecutive loads if reversed.
   int NumLoadsWantToChangeOrder;
----------------
mssimpso wrote:
> Please update the comment since we now track reversed bundles of size greater than 2.
Right, thanks!

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1158
@@ -1158,1 +1157,3 @@
+
+      // Check if the loads are consecutive or reversed.
       for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {
----------------
mssimpso wrote:
> We also check for the isSimple case here as well. It would be nice to make the comment more detailed since you're already changing it.
Sure.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1174
@@ +1173,3 @@
+        if (!isConsecutiveAccess(VL[i], VL[i + 1], *DL, *SE))
+          Consecutive = false;
+
----------------
mssimpso wrote:
> We can break out of the loop when setting Consecutive to false to avoid unneeded calls to isConsecutiveAccess.
> 
> But I'm wondering if there other ways to make this faster. We are looking for the all consecutive or all reverse-consecutive cases. So for example if we find at least one consecutive pair in this first loop, there's no chance the list can be all reverse-consecutive.
Good point. I'll try to make it a bit more efficient.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:3818
@@ -3803,3 +3817,3 @@
     if (allowReorder && R.shouldReorder()) {
       assert(Ops.size() == 2);
       assert(BuildVectorSlice.empty());
----------------
mssimpso wrote:
> This assert is confusing now. I think it only makes sense as-is because allowReorder is only true when coming from tryToVectorizePair. But you've removed the size-equal-2 restriction when checking for the reverse consecutive case. I think we should add a more explanatory comment. What do you think?
Right. At first, I removed the assert, and replaced this code with std::reverse, like the other case, but decided to keep the assert because it made which cases are currently supposed to get here clear.  I'll add a comment explaining this.

================
Comment at: test/Transforms/SLPVectorizer/X86/reduction_loads.ll:4-6
@@ +3,5 @@
+; CHECK-LABEL: @test
+; CHECK: %0 = bitcast i32* %p to <8 x i32>*
+; CHECK: %1 = load <8 x i32>, <8 x i32>* %0, align 4
+; CHECK: %2 = mul <8 x i32> <i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42>, %1
+
----------------
mssimpso wrote:
> Please use regular expressions for the unnamed instructions to avoid future breakage.
Sure.


https://reviews.llvm.org/D22554