[PATCH] D46126: [SLP] Vectorize transposable binary operand bundles

Tue May 1 08:46:37 PDT 2018

Ayal added inline comments.

================
Comment at: lib/Analysis/VectorUtils.cpp:535
+  return ConstantVector::get(Mask);
+}
+
----------------
mssimpso wrote:
> Ayal wrote:
> > Could `createStrideMask` be (re)used?
> Not directly, but we could use `createStrideMask` to create two stride-2 masks and then interleave them.
> 
> ```
> stride_mask_0 = <0, 2,  4, 6>
> stride_mask_1 = <8, 10, 12, 14>
> transpose_mask = <0, 8, 2, 10, 4, 12, 6, 14>
> ```
> 
> What do you think?
I'm a bit confused about the concatenation order which presumably leads to this interleaved strided mask, rather than using the strided <0, 2, 4, ..., 2*(VF-1)> mask directly. See comments below, also examining the VF=4 tests.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1463
+    Opcode2Operands[VLIOp0->getOpcode()].push_back(VLIOp0);
+    Opcode2Operands[VLIOp0->getOpcode()].push_back(VLIOp1);
+  }
----------------
mssimpso wrote:
> Ayal wrote:
> > Both operands are mapped to the opcode of operand 0?
> I've already checked that both operands have the same opcode (starting at line 1423). I also added a comment here, but this is probably still a bit confusing. I will rewrite this to make it more clear.
Ahh, of course...

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1472
+  // Find the size of the smallest isomorphic bundle. We will partition larger
+  // bundles so that each bundle is the same size.
+  unsigned MinSize =
----------------
Making all bundles have the same (smallest) size certainly simplifies things, but potentially misses performance opportunities; at-least in theory. Perhaps deserves a comment.

Suffice to check that `MinSize` is a power of 2, and that all bundle sizes are divisible by MinSize, rather than requiring all bundle sizes to be powers of 2. E.g., having one bundle of size 2 and another of size 6 should be fine, iiuc.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:1484
+  SmallVector<BoUpSLP::ValueList, 4> Bundles;
+  for (auto &Entry : Opcode2Operands) {
+    ArrayRef<Value *> Bundle(Entry.second);
----------------
ABataev wrote:
> `auto &`->`const BoUpSLP::ValueList &` or, probably, you may use `ArrayRef<Value*` here
This is a bit confusing: bundles are traversed according to their insertion order into Opcode2Operands, i.e., according to original operand order; but are aggregated and placed inside `Bundles` according to their opcode?

================
Comment at: test/Transforms/SLPVectorizer/AArch64/transpose.ll:93
   %tmp2.1 = add i32 %tmp1.0, %tmp1.1
   %tmp2.2 = add i32 %tmp0.2, %tmp0.3
   %tmp2.3 = add i32 %tmp1.2, %tmp1.3
----------------
Indeed this calls for shuffling with <0,4,2,6> and <1,5,3,7> masks; but is this pattern more natural to expect than
```
  %tmp2.1 = add i32 %tmp0.2, %tmp0.3
  %tmp2.2 = add i32 %tmp1.0, %tmp1.1
```

Perhaps in some complex numbers context(?)

================
Comment at: test/Transforms/SLPVectorizer/AArch64/transpose.ll:138
+; CHECK-NEXT:    [[TMP7:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP6]], <4 x i32> <i32 0, i32 4, i32 2, i32 6>
+; CHECK-NEXT:    [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP6]], <4 x i32> <i32 1, i32 5, i32 3, i32 7>
+; CHECK-NEXT:    [[TMP9:%.*]] = add <4 x i32> [[TMP7]], [[TMP8]]
----------------
Could this be done equally well with 'normal' strided masks, i.e.:

```
; CHECK-NEXT:    [[TMP5:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT:    [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT:    [[TMP7:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP6]], <4 x i32> <i32 0, i32 2, i32 4, i32 6>
; CHECK-NEXT:    [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP6]], <4 x i32> <i32 1, i32 3, i32 5, i32 7>
```

================
Comment at: test/Transforms/SLPVectorizer/AArch64/transpose.ll:153
   %tmp1.2 = xor i32 %v0.0, %v1.0
   %tmp1.3 = xor i32 %v0.1, %v1.1
   %tmp2.0 = add i32 %tmp0.0, %tmp0.1
----------------
This admittedly refers to the original test: `%tmp1.2 == %tmp0.2` and `%tmp1.3 == %tmp0.3`. Not sure if this was intentional, but  it does raise the issue of exercising bundles of originally different sizes, and potential reuse of same instruction multiple times. Are the four xor's first bundled together, and then broken into two adjacent bundles of MinSize=2?

================
Comment at: test/Transforms/SLPVectorizer/AArch64/transpose.ll:198
   %tmp2.1 = add i32 %tmp1.0, %tmp1.1
   %tmp2.2 = add i32 %tmp0.2, %tmp0.3
   %tmp2.3 = add i32 %tmp1.2, %tmp1.3
----------------
ditto.

Repository:
  rL LLVM

https://reviews.llvm.org/D46126