[PATCH] D50665: [LV][LAA] Vectorize loop invariant values stored into loop invariant address

Wed Aug 22 14:24:50 PDT 2018

Ayal added a comment.

In https://reviews.llvm.org/D50665#1208026, @anna wrote:

> In https://reviews.llvm.org/D50665#1206780, @Ayal wrote:
>
> > In https://reviews.llvm.org/D50665#1200509, @anna wrote:
> >
> > >
> >
> >
> > ...
> >
> > > Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.
> >
> >
> >
> > - - by revisiting LoopVectorizationCostModel::collectLoopUniforms()?  ;-)
>
>
> Right now, I just run instcombine after loop vectorization to clean up those unnecessary stores (and test cases make sure there's only one store left). Looks like there are other places in LV which relies on InstCombine as the clean up pass, so it may not be that bad after all? Thoughts?

Yeah, this is a bit embarrassing, but currently invariant loads also get replicated (and cleaned up later), despite trying to avoid doing so by recording IsUniform in VPReplicateRecipe. In general, if it's simpler and more consistent to generate code in a common template and potentially cleanup later, should be ok provided the cost model accounts for it accurately and cleanup is guaranteed, as checked by tests. BTW, LV already has an internal cse(). But in this case, VPlan should reflect the final outcome better, i.e., with a correct IsUniform. This should be taken care of, possibly by a separate patch.

================
Comment at: include/llvm/Analysis/LoopAccessAnalysis.h:638
+  /// StoreToLoopInvariantAddress.
+  bool NonVectorizableStoreToLoopInvariantAddress;
   /// The diagnostics report generated for the analysis.  E.g. why we
----------------
anna wrote:
> Ayal wrote:
> > Better name it more accurately as, e.g., `VariantStoreToLoopInvariantAddress`?
> Okay, I'll change the name.
> 
> JFI- Today the changed name (VariantStoreToLoopInvariantAddress)  is accurate. However, my plan is to eventually teach the vectorizer about all *safe* uniform stores, not just invariant values stored to invariant address.
> 
> So `VariantStoreToLoopInvariantAddress` can also be vectorized under certain conditions (safe dependence distance calculation for the store versus other memory access). So something like example below can be vectorized [1]:
> ```
>   for (i=0; i<n;i++)
>       for (j=0; j<n; j++) {
>              p[i] = b[j]; 
>              z[j] += b[j];
>       }   
> }
> ```
> 
> However, this cannot be vectorized safely:
> ```
>   for (i=0; i<n;i++)
>       for (j=0; j<n; j++) {
>              z[j] = (++p[i]);  <-- dependence distance for uniform store and load is 1.
>       }   
> }
> ```
> [1] LICM should try to sink the store out of inner loop, but sometimes it cannot do so because it cannot prove dereferencability for the store address or that the store is guaranteed to execute at least once.
Yes, in/variant stores to an invariant address may carry cross-iteration dependencies with other loads/store, which could potentially be checked at runtime similar to 'regular' stores. LV supports reductions/inductions if carried by temporaries only, rather than via memory. Such cases should indeed be LICM'd before vectorization - sinking unconditional stores down to a dominated "middle" block, where it's dereferencable and known to have executed at least once.

================
Comment at: include/llvm/Analysis/LoopAccessAnalysis.h:568
+  /// Checks existence of stores to invariant address inside loop.
+  /// If such stores exist, checks if those are non-vectorizable stores.
+  bool hasVariantStoreToLoopInvariantAddress() const {
----------------
Update above comment as well: "non-vectorizable stores" >> "of variant values"

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1871
+      return false;
+    return TheLoop->hasLoopInvariantOperands(cast<Instruction>(StoreVal));
+  };
----------------
anna wrote:
> Ayal wrote:
> > Again, something LICM may have missed?
> yes, LICM misses this as well - see added test case in `inv_val_store_to_inv_address_conditional_inv`. 
Ahh, but a phi of invariant values is invariant iff the compares that decide which predecessor will reach the phi, are also invariant. In `inv_val_store_to_inv_address_conditional_inv` this holds because there `%cmp` determines which predecessor it'll be, and `%cmp` is invariant. In general Divergence Analysis is designed to provide this answer, as in D50433's `isUniform()`.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:5880
         unsigned Cost = getUniformMemOpCost(&I, VF);
         setWideningDecision(&I, VF, CM_Scalarize, Cost);
         continue;
----------------
anna wrote:
> Ayal wrote:
> > On certain targets, e.g., skx, an invariant store may end up as a scatter, so setting this decision here to avoid that is important; potentially worthy of a note / a test.
> thanks for bringing this up. It exercised the `X86TTIImpl::getMemoryOpCost` which showed the bug in my previous diff for `LoopVectorizationCostModel::getUniformMemOpCost` for uniform store. I was passing in the store's type instead of the store val type. 
> I've also updated it to use the "unified" interface for load/store just like the other cost model calculations - `getGatherScatterCost` etc.
very good

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:5766
+  bool isLoopInvariantValueStored =
+      TheLoop->isLoopInvariant(SI->getValueOperand());
   return TTI.getAddressComputationCost(ValTy) +
----------------
Should be consistent and use the same `isLoopInvariantStoreValue()` noted above.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:5870
+
+      if ((isa<LoadInst>(&I) || isa<StoreInst>(&I)) && Legal->isUniform(Ptr)) {
+        // Load: Scalar load + broadcast
----------------
We expect here that `isa<LoadInst>(&I) || isa<StoreInst>(&I)` (as `memoryInstructionCanBeWidened()` will assert below) having checked `getLoadStorePointerOperand(&I)` above. 

Repository:
  rL LLVM

https://reviews.llvm.org/D50665