[PATCH] D26905: [SLP] Vectorize loads of consecutive memory accesses, accessed in non-consecutive (jumbled) way.

Tue Nov 22 06:35:26 PST 2016

ashahid added a comment.

Hi Simon, Michael

Thanks for the comments. Pls find the response inlined.

Thanks,
Shahid

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1022
+                                        ScalarEvolution &SE) {
+  std::multimap<int, Value *> offValPair;
+  for (unsigned i = 0; i < VL.size(); i++) {
----------------
mkuper wrote:
> Why is this a multimap?
This is because the elements in the multimap follow a certain order, so using this will ensure that the values are sorted accordingly.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:464
+    /// \returns true if the scalars in VL are found in this entry.
+    bool isJumbled(ArrayRef<Value *> VL) const {
+      assert(VL.size() == Scalars.size() && "Invalid size");
----------------
mkuper wrote:
> I think you may want a different name - this doesn't actually check whether the scalars are jumbled, it checks whether they're all present. That is, it'll return true even if they're all in-order.
What about isFoundJumbled()?

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:466
+      assert(VL.size() == Scalars.size() && "Invalid size");
+      for (unsigned i = 0; i < VL.size(); i++) {
+        bool Found = false;
----------------
mkuper wrote:
> RKSimon wrote:
> > This might be simplified with a for range loop and use of llvm:none_of / any_of?
> Would it make sense to pre-sort both arrays, and then check the two sorted arrays are equal? This would make it O(nlogn) instead of O(n^2)
> (I'm not sure sort based on what, though - as well as the actual gain, since I guess VL.size() is small in practice.)
Pre-sorting would require two calls for sort and then compare, IMO, for the given small VL.size it would not make much difference. However I am open to other views.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1197
       // Check if the loads are consecutive, reversed, or neither.
       // TODO: What we really want is to sort the loads, but for now, check
       // the two likely directions.
----------------
mkuper wrote:
> This TODO gets done.
Yes, that's right.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1217

+      if (VL.size() > 2) {
+        SmallVector<Value *, 8> list;
----------------
mkuper wrote:
> mkuper wrote:
> > Do we still need the ReverseConsecutive case at all? That was introduced in r276477 as a patch for the common case of PR28474, where we find the loads in reverse order. But this should completely supersede it, right?
> Why not for VL.size() == 2?
A jumbled VL of VL.size() == 2 is essentially a case of reversed VL. Considering the tradeoff between compile time of extra buildTree() for VL.size==2  vs additional runtime for shufflevector, I opted for extra compile time over extra runtime.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1219
+        SmallVector<Value *, 8> list;
+        auto newVL = (ArrayRef<Value *>)0;
+        bool shuffledLoads = true;
----------------
mkuper wrote:
> This looks rather weird. Can you make it more idiomatic?
Sure

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1228
+          }
+        }
+        if (shuffledLoads) {
----------------
RKSimon wrote:
> Is it worth breaking here once we know that shuffledLoad is false?
> 
> Remove the braces if you can.
Seems yes.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:2582
+            TreeEntry *E = &VectorizableTree[Idx];
+            for (unsigned Lane = 0, LE = VL.size(); Lane != LE; ++Lane) {
+              if (E->Scalars[Lane] == VL[i]) {
----------------
RKSimon wrote:
> for range loop?
Sure

================
Comment at: test/Transforms/SLPVectorizer/X86/jumbled-load.ll:3
+
+
+; CHECK-LABEL: @jumbled-load
----------------
RKSimon wrote:
> Possibly commit this test to trunk with the current output generated by utils/update_test_checks.py?
By "current output" do you mean output generated by utils/update_test_checks.py with this patch by ?

================
Comment at: test/Transforms/SLPVectorizer/X86/jumbled-load.ll:10
+
+define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %out) {
+
----------------
mkuper wrote:
> Please add a test that has several load packets (e.g. multiplies one load sequence by another load sequence).
Sure

================
Comment at: test/Transforms/SLPVectorizer/X86/reduction_loads.ll:8
+; CHECK: mul <8 x i32> <i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42>, [[SHUF]]

 define i32 @test(i32* nocapture readonly %p) {
----------------
mkuper wrote:
> RKSimon wrote:
> > It'd be better if there was more context to this additional shuffle - regenerate + commit the current output with utils/update_test_checks.py ? For an IR loop it's not that large an output.
> Yes, I'd be interested to know if we added a shuffle here, or just moved a shuffle from the store side to the load side (which makes sense). 
By "current output" do you mean output generated by utils/update_test_checks.py with this patch by ? Pls explain.

https://reviews.llvm.org/D26905