[PATCH] D51313: [LV] Fix code gen for conditionally executed uniform loads

Wed Aug 29 08:23:35 PDT 2018

anna added inline comments.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:5883
+          // gather/scatters are not considered scalar with predication.
+          !Legal->blockNeedsPredication(I.getParent())) {
         // Scalar load + broadcast
----------------
Ayal wrote:
> anna wrote:
> > Ayal wrote:
> > > Indeed scalarizing a conditional load from a uniform address is currently broken, when it can turn into a gather. The culprit is that VPReplicateRecipe, when constructed for such instructions, uses an incorrect IsPredicate = isScalarWithPredication(). Which, as noted, is false for loads and stores of types that can be gathered/scattered. I.e., really means mustBeScalarWithPredication. (That, on top of IsUniform which is also incorrect, causing replication as noted in D50665#1209958.)
> > > 
> > > This proposed fix to refrain from scalarizing such loads, should** make sure they always turn into a gather**, rather than rely on the cost-based gather-vs-scalarize decision below.
> > > 
> > > The above holds also for conditional loads from **non-uniform** addresses, that can turn into gathers, but possibly also get incorrectly scalarized w/o branches. It's hard to test, as the scalarization cost needs to be better than the gather for this to occur. But best make sure these also always turn into a gather.
> > > 
> > > It would also/alternatively be good to fix the scalarization of such loads (by correcting IsPredicate), and enable cost-based gather-vs-scalarize decision. Presumably gather should always win, unless perhaps if the load feed scalarized users.
> > It seems to me that we need a new function `LoopVectorizationCostModel::isPredicated` which is a superset of instructions contained by `LoopVectorizationCostModel::isScalarWithPredication`? 
> > So, the 'complete' fix here is to correct the costmodel as done above *and* fix `isPredicate` in `VPRecipeBuilder::handleReplication`:
> > ```
> > bool IsPredicated = CM.isPredicated(I);
> > auto *Recipe = new VPReplicateRecipe(I, IsUniform, IsPredicated);
> > ```
> > 
> > Essentially, whether we use a gather/scatter or we do plain old branch + load/store, (because we don't have gather/scatter support), both are predicated - one is a 'vector predicate', other is scalarized + predicated.
> > 
> As for a 'complete' fix...
> The ten calls to isScalarWithPredication() don't all ask the same thing; it mostly depends on whether setWideningDecision() already took place or not; i.e., the decision if a load/store will be CM_Scalarize (with or w/o predication) or be a CM_GatherScatter. Those called after setWideningDecision() depend on the widening decision set, which depends on the VF, essentially asking if willBeScalarWithPredication(); i.e., excluding CM_GatherScatter. They are the majority, so best have them continue to use isScalarWithPredication(), passing it VF. Three are called before, essentially asking distinct things - here's a breakdown:
> 
>   # memInstCanBeWidened(): calls before the decision, in order to make it. Already knows it's a Load/Store. Needs to check if blockNeedsPredication && isMaskRequired && !isLegalMaskedLoad/Store. Note that isLegalMaskedGather/Scatter are irrelevant as they are not considered "Widened".
> 
>   # useEmulatedMaskMemRefHack(): calls once before the decision and again after it. But only to assert(isScalarWithPredication()), so need to carefully assert only what's valid in both contexts.
> 
>   # getMemInstScalarizationCost: checks what the cost would be if the Load/Store would be scalarized. Needs to check if blockNeedsPredication && isMaskRequired. Note that both isLegalMaskedLoad/Store and isLegalMaskedGather/Scatter are irrelevant.
> 
> Of the calls that take place after the decision, note that: 
> 
>   # setCostBasedWideningDecision: bumping NumPredStores++ should best be postponed to after the (CM_Scalarize) decision is reached, at the bottom.
> 
>   # handleReplication: should call isScalarWithPredication(I, VF) from within getDecisionAndClampRange(), as done with IsUniform.
> 
>   # tryToWiden: should call !isScalarWithPredication(I, VF) from within getDecisionAndClampRange().
> 
> 
> Devising tests would be challenging, e.g., making gather/scatter decisions for some VF's yet scalarization for others.
Thanks for the info Ayal, but unfortunately I don't have the bandwidth currently for a complete fix of all the problems described here. 

My small fix here is the uniform accesses should be handled correctly for the load case  irrespective of the VF and VPlan. (and then it will be extended for the store as the fix for D50665), but as you pointed out it only adjusts the cost model. The actual vectorization should also be fixed so that a gather is guaranteed to remain a gather.

>From your explanation and reading the code, my understanding is that the right fix for the uniform accesses is:
1. what I've currently got here in `setCostBasedWideningDecision` to avoid CM_Scalarize for predicated blocks
2. Fix `memoryInstructionCanBeWidened` use of isScalarWithPredication 
3. Fix for `getMemInstScalarizationCost` use of isScalarWithPredication  so that we will correctly choose the decision.
4. Implement isScalarWithPredication(I, VF) where we retrive the widening_decision for the VF and make sure it's NOT CM_gatherscatter.
4. handleReplication calls isScalarWithPredication(I, VF)
5. tryToWiden: should call !isScalarWithPredication(I, VF) 

Repository:
  rL LLVM

https://reviews.llvm.org/D51313