[PATCH] D113973: [LoopVectorize][CostModel] Choose smaller VFs for in-loop reductions with no loads/stores

Mon Nov 22 02:40:18 PST 2021

fhahn added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:6353
         const RecurrenceDescriptor &RdxDesc = Legal->getReductionVars()[PN];
         if (PreferInLoopReductions || useOrderedReductions(RdxDesc) ||
             TTI.preferInLoopReduction(RdxDesc.getOpcode(),
----------------
dmgreen wrote:
> sdesmalen wrote:
> > dmgreen wrote:
> > > This deliberately excludes InLoopReductions from the maximum width of the register - because the phi remains scalar and it's useful for integer reductions under MVE where they can sum a vecreduce(v16i32 sext(v8i32)) in a single operation. That might not be as useful for float types - and the example loop you showed with no loads/stores in the loop is much less likely for integer - it will already have been simplified.
> > > 
> > > Perhaps we just remove this for ordered (float) reductions? Or does that lead to regressions?
> > > This deliberately excludes InLoopReductions from the maximum width of the register - because the phi remains scalar and it's useful for integer reductions under MVE where they can sum a vecreduce(v16i32 sext(v8i32)) in a single operation.
> > Yes, the same is true for SVE. There is also code in the cost-model to recognise the extend to the wider type.
> > 
> > I think the actual reason for the reduction-code being here is to avoid ending up with vector PHIs that are too wide (out-of-loop reduction). The checks for in-loop reductions were added later because (1) there is no vector PHI and (2) that it doesn't limit the VF too early so that it lets the cost-model code consider the wider VFs for the reason you described.
> > 
> > > Perhaps we just remove this for ordered (float) reductions? Or does that lead to regressions?
> > I don't think we should add specific knowledge to limit the VF for fp-reductions here, because that means adding target-specific knowledge to the `collectElementTypesForWidening`, which is something that the cost-model should decide on. Also it would lead to regressions.
> Yeah, we added it in https://reviews.llvm.org/D93476. I would have no objections to removing this for ordered reductions or float types I don't think. There is no such thing (as far as I understand) as an extending in-order float reduction, and we can rely on the UF for things that will become wider-than-legal vectors.
Would it be possible to use TTI to pick the min/max type width to use there for the reduction, which would encode the target specific knowledge?

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll:47-48
+for.body:
+  %s.09 = phi double [ 0.000000e+00, %entry ], [ %add, %for.body ]
+  %i.08 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
+  %conv = sitofp i64 %i.08 to double
----------------
sdesmalen wrote:
> lebedev.ri wrote:
> > Where is `ElementTypesInLoop` populated?
> > `LoopVectorizationCostModel::collectElementTypesForWidening()` suggests that PHI nodes are also queried.
> The function `collectElementTypesForWidening()` collects types for the following reasons:
> * Types of loads/stores, because these are used in computing a safe dependence distance (and also to have a sensible/natural VF as maximum VF).
> * Types of unordered reductions, in order to avoid generating vector PHI nodes that span multiple registers when the VF is too wide. In-order reductions are not considered, because those PHI nodes remain scalar after vectorization.
> 
> If we consider *more* types in the loop in `collectElementTypesForWidening()`, e.g. for in-order reductions or extends, then this leads to regressions; any type larger than the maximum loaded/stored type will limit the maximum VF. If the maximum VF is limited, then any VF upto the wider maximum will not be considered by the cost-model. So in general, it's better not to limit the VF for those reasons and leave it up to the cost-model to choose the best VF.
> 
> In the test-case that @RosieSumpter added, there are no loads/stores and there is no vector PHI node because the reduction is ordered, so it considers any VF up-to the widest-possible VF based on an i8 element type (even if the loop operates on a larger size). The cost-model then only considers the throughput cost, and the per-lane cost is no different between VF=4 and VF=16, so it favours VF=16. It does not consider the additional code-size cost when the operation is scalarized. 
> 
> I guess the alternative choices are to:
> * Look through the operands of the ordered reduction operation, as well as any casts/extends on those operands, and only consider these element types in `collectElementTypesForWidening()` if there are no loads/stores in the loop. This heuristic gets quite complicated I think. 
> * Consider the code-size cost for in-order reductions, so that it favours shorter VFs (I'm not sure if this is desirable, as it may regress other cases where code-size isn't relevant).
> 
> Because the case is so niche (i.e. a loop with *only* in-order reductions), @RosieSumpter thought it made sense to fall back to a more sensible default instead.
> 
I think picking the smallest integer type as max width comes with its own problems unfortunately. By choosing 32 bit as max width on AArch64, won't we pessimize codegen for loops with fp16 in loop reductions (by choosing VF 4 instead of VF 8) for example?

It is hard to say what if there are similar impacts on other targets.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D113973/new/

https://reviews.llvm.org/D113973