[PATCH] D51313: [LV] Fix code gen for conditionally executed uniform loads

Fri Sep 7 09:35:32 PDT 2018

Ayal added a comment.

(post commit review)

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:5883
+          // gather/scatters are not considered scalar with predication.
+          !Legal->blockNeedsPredication(I.getParent())) {
         // Scalar load + broadcast
----------------
anna wrote:
> Ayal wrote:
> > anna wrote:
> > > Ayal wrote:
> > > > Indeed scalarizing a conditional load from a uniform address is currently broken, when it can turn into a gather. The culprit is that VPReplicateRecipe, when constructed for such instructions, uses an incorrect IsPredicate = isScalarWithPredication(). Which, as noted, is false for loads and stores of types that can be gathered/scattered. I.e., really means mustBeScalarWithPredication. (That, on top of IsUniform which is also incorrect, causing replication as noted in D50665#1209958.)
> > > > 
> > > > This proposed fix to refrain from scalarizing such loads, should** make sure they always turn into a gather**, rather than rely on the cost-based gather-vs-scalarize decision below.
> > > > 
> > > > The above holds also for conditional loads from **non-uniform** addresses, that can turn into gathers, but possibly also get incorrectly scalarized w/o branches. It's hard to test, as the scalarization cost needs to be better than the gather for this to occur. But best make sure these also always turn into a gather.
> > > > 
> > > > It would also/alternatively be good to fix the scalarization of such loads (by correcting IsPredicate), and enable cost-based gather-vs-scalarize decision. Presumably gather should always win, unless perhaps if the load feed scalarized users.
> > > It seems to me that we need a new function `LoopVectorizationCostModel::isPredicated` which is a superset of instructions contained by `LoopVectorizationCostModel::isScalarWithPredication`? 
> > > So, the 'complete' fix here is to correct the costmodel as done above *and* fix `isPredicate` in `VPRecipeBuilder::handleReplication`:
> > > ```
> > > bool IsPredicated = CM.isPredicated(I);
> > > auto *Recipe = new VPReplicateRecipe(I, IsUniform, IsPredicated);
> > > ```
> > > 
> > > Essentially, whether we use a gather/scatter or we do plain old branch + load/store, (because we don't have gather/scatter support), both are predicated - one is a 'vector predicate', other is scalarized + predicated.
> > > 
> > As for a 'complete' fix...
> > The ten calls to isScalarWithPredication() don't all ask the same thing; it mostly depends on whether setWideningDecision() already took place or not; i.e., the decision if a load/store will be CM_Scalarize (with or w/o predication) or be a CM_GatherScatter. Those called after setWideningDecision() depend on the widening decision set, which depends on the VF, essentially asking if willBeScalarWithPredication(); i.e., excluding CM_GatherScatter. They are the majority, so best have them continue to use isScalarWithPredication(), passing it VF. Three are called before, essentially asking distinct things - here's a breakdown:
> > 
> >   # memInstCanBeWidened(): calls before the decision, in order to make it. Already knows it's a Load/Store. Needs to check if blockNeedsPredication && isMaskRequired && !isLegalMaskedLoad/Store. Note that isLegalMaskedGather/Scatter are irrelevant as they are not considered "Widened".
> > 
> >   # useEmulatedMaskMemRefHack(): calls once before the decision and again after it. But only to assert(isScalarWithPredication()), so need to carefully assert only what's valid in both contexts.
> > 
> >   # getMemInstScalarizationCost: checks what the cost would be if the Load/Store would be scalarized. Needs to check if blockNeedsPredication && isMaskRequired. Note that both isLegalMaskedLoad/Store and isLegalMaskedGather/Scatter are irrelevant.
> > 
> > Of the calls that take place after the decision, note that: 
> > 
> >   # setCostBasedWideningDecision: bumping NumPredStores++ should best be postponed to after the (CM_Scalarize) decision is reached, at the bottom.
> > 
> >   # handleReplication: should call isScalarWithPredication(I, VF) from within getDecisionAndClampRange(), as done with IsUniform.
> > 
> >   # tryToWiden: should call !isScalarWithPredication(I, VF) from within getDecisionAndClampRange().
> > 
> > 
> > Devising tests would be challenging, e.g., making gather/scatter decisions for some VF's yet scalarization for others.
> Thanks for the info Ayal, but unfortunately I don't have the bandwidth currently for a complete fix of all the problems described here. 
> 
> My small fix here is the uniform accesses should be handled correctly for the load case  irrespective of the VF and VPlan. (and then it will be extended for the store as the fix for D50665), but as you pointed out it only adjusts the cost model. The actual vectorization should also be fixed so that a gather is guaranteed to remain a gather.
> 
> From your explanation and reading the code, my understanding is that the right fix for the uniform accesses is:
> 1. what I've currently got here in `setCostBasedWideningDecision` to avoid CM_Scalarize for predicated blocks
> 2. Fix `memoryInstructionCanBeWidened` use of isScalarWithPredication 
> 3. Fix for `getMemInstScalarizationCost` use of isScalarWithPredication  so that we will correctly choose the decision.
> 4. Implement isScalarWithPredication(I, VF) where we retrive the widening_decision for the VF and make sure it's NOT CM_gatherscatter.
> 4. handleReplication calls isScalarWithPredication(I, VF)
> 5. tryToWiden: should call !isScalarWithPredication(I, VF) 
> 
> Thanks for the info Ayal, but unfortunately I don't have the bandwidth currently for a complete fix of all the problems described here.

OK, sure, understood (and thanks BTW for helping take care of PR38786 :-). This was in response to your "So, the 'complete' fix here is to ..."

OTOH, looks like this patch does take care of nearly everything listed above ... except for (1) memInstCanBeWidened() and (1) bumping NumPredStores after we've made the decision - which raises a subtle dependence issue as useEmulatedMaskMemRefHack() uses NumPredStores. And making sure VF is passed to isScalarWithPredication() wherever it should.

So out of the ten calls to isScalarWithPredication(), two are redirected here to isPredicated() - the assert in useEmulatedMaskMemRefHack() and the call in getMemInstScalarizationCost(), effectively taking care of (2) and (3) above. Note that both evidently deal with Mem loads/stores. As an alternative, collectInstsToScalarize() could first check if I is a load/store before it calls useEmulatedMaskMemRefHack(&I). This way, isPredicate(I) will always be used for loads/stores, and could be folded into an `if (blockNeedsPredication && isMaskRequired)`, as noted above for (3).

> This proposed fix to refrain from scalarizing such loads, should make sure they always turn into a gather, rather than rely on the cost-based gather-vs-scalarize decision below.

So the fix here allows conditional loads from uniform addresses to eventually turn into gathers or be scalarized, according to cost-model decision, right? No enforcement to scalarize appears below. 

Wonder if the above addition of !BlockNeedsPredication is still needed(?)

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:1424
+  /// If a non-zero VF has been calculated, we check if I will be scalarized
+  /// predication for that VF.
+  bool isScalarWithPredication(Instruction *I, unsigned VF = 1);
----------------
"non-zero" >> "non one": VF is always non-zero. VF=1 usually stands for the original scalar instructions (e.g., possibly unrolled by UF). Here, iiuc, VF>1 corresponds to asking willBeScalarWithPredication() - after per-VF decisions have been taken, whereas VF=1 corresponds to asking mustBeScalarWithPredication() - before per-VF decisions have been taken; as called by the assert of useEmulatedMaskMemRefHack() in one context. But tryToWiden() and handleReplication() may also pass VF=1, and they call after decisions have been made.

Best use isScalarWithPredication() always after per-VF decisions have been made. In any case, better have each caller specify the VF explicitly, to make sure it's passed whenever available; e.g., see below.

"Such instructions include" also conditional loads, with this patch.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:1428
+  // Returns true if \p I is an instruction that will be predicated either
+  // through scalar predication or masked load/store or masked gather/scatter.
+  // Superset of instructions that return true for isScalarWithPredication.
----------------
Note that this interface is VF agnostic; some VF's may decide to scalarize/replicate where other VF's gather/scatter, but they all predicate, or none do.

As mentioned below, this could be simplified to mean isMaskedMemInst(), w/o needing to call isScalarWithPredication(VF=1).

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:4407
         !(isLegalMaskedLoad(Ty, Ptr)  || isLegalMaskedGather(Ty))
       : !(isLegalMaskedStore(Ty, Ptr) || isLegalMaskedScatter(Ty));
   }
----------------
When called after having made per-VF decisions, having VF=1 simply says it's scalar, rather than checking mustBeScalar here.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:5519
     for (Instruction &I : *BB)
       if (isScalarWithPredication(&I)) {
         ScalarCostsTy ScalarCosts;
----------------
`isScalarWithPredication(&I, VF)`

Repository:
  rL LLVM

https://reviews.llvm.org/D51313