[PATCH] D113973: [LoopVectorize][CostModel] Choose smaller VFs for in-loop reductions with no loads/stores

Tue Nov 16 10:02:04 PST 2021

sdesmalen added inline comments.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll:47-48
+for.body:
+  %s.09 = phi double [ 0.000000e+00, %entry ], [ %add, %for.body ]
+  %i.08 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
+  %conv = sitofp i64 %i.08 to double
----------------
lebedev.ri wrote:
> Where is `ElementTypesInLoop` populated?
> `LoopVectorizationCostModel::collectElementTypesForWidening()` suggests that PHI nodes are also queried.
The function `collectElementTypesForWidening()` collects types for the following reasons:
* Types of loads/stores, because these are used in computing a safe dependence distance (and also to have a sensible/natural VF as maximum VF).
* Types of unordered reductions, in order to avoid generating vector PHI nodes that span multiple registers when the VF is too wide. In-order reductions are not considered, because those PHI nodes remain scalar after vectorization.

If we consider *more* types in the loop in `collectElementTypesForWidening()`, e.g. for in-order reductions or extends, then this leads to regressions; any type larger than the maximum loaded/stored type will limit the maximum VF. If the maximum VF is limited, then any VF upto the wider maximum will not be considered by the cost-model. So in general, it's better not to limit the VF for those reasons and leave it up to the cost-model to choose the best VF.

In the test-case that @RosieSumpter added, there are no loads/stores and there is no vector PHI node because the reduction is ordered, so it considers any VF up-to the widest-possible VF based on an i8 element type (even if the loop operates on a larger size). The cost-model then only considers the throughput cost, and the per-lane cost is no different between VF=4 and VF=16, so it favours VF=16. It does not consider the additional code-size cost when the operation is scalarized. 

I guess the alternative choices are to:
* Look through the operands of the ordered reduction operation, as well as any casts/extends on those operands, and only consider these element types in `collectElementTypesForWidening()` if there are no loads/stores in the loop. This heuristic gets quite complicated I think. 
* Consider the code-size cost for in-order reductions, so that it favours shorter VFs (I'm not sure if this is desirable, as it may regress other cases where code-size isn't relevant).

Because the case is so niche (i.e. a loop with *only* in-order reductions), @RosieSumpter thought it made sense to fall back to a more sensible default instead.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D113973/new/

https://reviews.llvm.org/D113973