[PATCH] D52653: [CodeGen, AArch64] Combine Interleaved Loads which are not covered by the Vectorizer

Mon Oct 15 11:12:52 PDT 2018

marels added a comment.

@john.brawn thanks for the input. I commented on some and will upload a new revision shortly

================
Comment at: lib/CodeGen/InterleavedLoadCombinePass.cpp:112
+  /// Dominator Tree Analysis
+  DominatorTree *DT = nullptr;
+
----------------
john.brawn wrote:
> I don't think any of the dominator checks in this file are needed, so this isn't needed.
I think some of the dominator checks can be removed, but at least one needs to remain (see comment in line 1229)

================
Comment at: lib/CodeGen/InterleavedLoadCombinePass.cpp:225-226
+
+  /// Value
+  Value *V;
+
----------------
john.brawn wrote:
> I don't think this is needed. hasB() can check !B.empty(), and in isCompatibleTo() we can assume that the user of this class will be using polynomials of the same variable.
I think it is necessary. The vector B stores the operations executed on V, but V is the value those operations are applied on. Example:

```
define @fn(i64 %a, i64 %b, u8 *%base)
%ptr1 = gep %base, %a 
%ptr2 = gep %base, %b
```

This leads to two incompatible polynomials:
```
1*%a + 0 and 
1*%b + 0```
whereas:
```
Pa = { .B = <{mul, 1}>, V = %a, A = 0}
```
and
```
Pb = { .B = <{mul, 1}>, V = %b, A = 0}
```

Currently `{mul, 1}` is stored in `B` thus `B.empty()` could be used. But this is not necessary but pushing {mul, 1} unnecessarily restricts the computation, and i think I will remove this.

================
Comment at: lib/CodeGen/InterleavedLoadCombinePass.cpp:679
+  /// Information of a Vector Element
+  struct ElementInfo {
+    /// Offset Polynomial.
----------------
john.brawn wrote:
> LI is used only for the InsertionPoint in combine, and you can find that by picking the least element of VectorInfo::LIs e.g.
> 
> ```
>   LoadInst *getFirstLoad() const { return *std::min_element(LIs.begin(), LIs.end()); }
> ```
> so LI isn't needed here, so we don't need ElementInfo at all and can just have an array of Polynomial.
I am not sure if this field can be omitted:

  - The InsertionPoint is not always the first load. It is the load instruction loading the from the memory address where the first element of the interleave is located at. This insertion point is selected because at this position it is guaranteed that the Pointer Value from which that combined load has to load from is available and that this value can be used just by adding a pointer cast.
  - The LI field also guarantees that this load exists. This is not always the case consider for example the wired case:
```
%ptr1 = &a[0];
%ptr2 = &a[9];
%v1 =<12 x float> load %ptr1
%v2 =<12 x float> load %ptr2
%s1 = <4 x float> shufflevector %v1, %v2, <1, 5, 9, 16>
%s2 = <4 x float> shufflevector %v1, %v2, <2, 6, 10, 17>
%s3 = <4 x float> shufflevector %v1, %v2, <3, 7, 11, 18>
%s4 = <4 x float> shufflevector %v1, %v2, <4, 8, 15, 19>
```
This interleave would be detected in the first step, but the transformation would be aborted because the there is no load that loads from &a[1].

================
Comment at: lib/CodeGen/InterleavedLoadCombinePass.cpp:1137
+LoadInst *
+InterleavedLoadCombine::findDominator(const std::set<LoadInst *> &LIs) {
+  for (auto &Cand : LIs) {
----------------
john.brawn wrote:
> All loads are required to be in the same block, so the dominator will just be the first load.
Thats true, I will replace it by some kind of 
```std::find_if(BB.begin(), BB.end(), [] {...});```

Note that LIs is a set of pointer values, thus it is ordered by addresses to guarantee uniqueness. It is not ordered in respected to dominance so `LIs.front()` will not work here.

================
Comment at: lib/CodeGen/InterleavedLoadCombinePass.cpp:1229
+  for (auto &VI : InterleavedLoad) {
+    if (!DT->dominates(InsertionPoint, VI->SVI))
+      return false;
----------------
john.brawn wrote:
> This isn't needed, as if this weren't the case the IR wouldn't be valid in the first place (I'm pretty sure).
I think it is. Again consider a weird but possible case:

```
%ptr2 = &a[2]
%v2 = <14 x float> load %ptr2
%s3 = <4 x float> shufflevector %v2, undef, <0, 4, 8, 12>
%s4 = <4 x float> shufflevector %v2, undef, <1, 5, 9, 13>
%x = some-nonaliasing-use-eg-extract-element-instr of %s3 or %s4 
%ptr2 = &a[0];
%v1 = <14 x float> load %ptr2
%s1 = shufflevector %v1, undef, <0, 4, 8, 12>
%s2 = shufflevector %v1, undef, <1, 5, 9, 13>

``` 

Unless moving and proving that %ptr2 is movable before %v2 the insertion point must be at %v1. However because %v1 does not dominate %x the transformation would generate an invalid IR. This loop prevents this case. Also because the shuffle vector instructions are not necessarily in the same basic block I think that the dominator relation is the way to go.

Also not the extract element instructions are inserted just before the %s1 to %s3 respectively. 

================
Comment at: lib/CodeGen/InterleavedLoadCombinePass.cpp:1292-1294
+        if (isa<ShuffleVectorInst>(&I)) {
+
+          auto SVI = cast<ShuffleVectorInst>(&I);
----------------
john.brawn wrote:
> You can do this all at once with
> 
> ```
> if (auto SVI = dyn_cast<ShuffleVectorInst>(&I))
> ```
yep (old C habit, sorry)

================
Comment at: lib/CodeGen/InterleavedLoadCombinePass.cpp:1295-1296
+          auto SVI = cast<ShuffleVectorInst>(&I);
+          if (DL.getTypeAllocSize(SVI->getType()) % VectorWidth)
+            continue;
+
----------------
john.brawn wrote:
> Having a predetermined width multiple and using that as the basis for when to try this doesn't seem right. It would make more sense to use TTI->getInterleavedMemoryOpCost and check if we expect the cost to be lower than not doing the transform. I tried
> 
> ```
>           VectorType *Ty = SVI->getType();
>           SmallVector<uint32_t, 4> Mask;
>           for (unsigned i = 0; i < Factor; i++)
>              Mask.push_back(i);
> 
>           VectorType *InterleaveTy = VectorType::get(Ty->getScalarType(), Ty->getNumElements() * Factor);
>           int InterleavedCost = TTI->getInterleavedMemoryOpCost(Instruction::Load, InterleaveTy,
>                                                                 Factor, Mask, 16, 0); // TODO: correct alignment, address space
>           int MemoryOpCost = TTI->getMemoryOpCost(Instruction::Load, Ty, 16, 0);
>           int ShuffleCost = TTI->getInstructionCost(SVI, TargetTransformInfo::TCK_Latency);
>           if (InterleavedCost > Factor * (MemoryOpCost + ShuffleCost)) {
>             continue;
>           }
> ```
> and it seemed like it worked, though maybe here is the wrong place to do this check and maybe it should be in InterleavedLoadCombine::combine.
Good point. I think the correct location for this is in  InterlevedLoadCombine::combine which still can bail. Moreover this function has knowledge of all participating instructions that will be replaced and left dead. Thus we can more precisely determine the cost of all instructions. Also not that the Loads do not necessarily have to be symmetrical. Hence summing instead of multiplying will be better.

================
Comment at: lib/CodeGen/InterleavedLoadCombinePass.cpp:1303-1307
+          if (!VI->SVI)
+            continue;
+
+          if (!VI->PV)
+            continue;
----------------
john.brawn wrote:
> SVI and PV should never be null if compute returned true, so these should be asserts that SVI and PV are non-null.
When calling `VectorInfo::compute()` in general this is true for `VI->PV` but for `VI->SVI` only if the instruction is a `ShuffleVectorInst` which it is in this case. I will change this but I will also call `computeFromSVI` directly.  

================
Comment at: tools/opt/opt.cpp:466
   initializeIndirectBrExpandPassPass(Registry);
+  initializeInterleavedLoadCombinePass(Registry);
   initializeInterleavedAccessPass(Registry);
----------------
john.brawn wrote:
> Do we also need a similar call in initializeCodeGen in lib/CodeGen/CodeGen.cpp?
Not sure too. But, I guess, adding it there will not harm.

Repository:
  rL LLVM

https://reviews.llvm.org/D52653