[PATCH] D71919: [LoopVectorize] Disable single stride access predicates when gather loads are available.

Thu Jan 2 03:11:34 PST 2020

dmgreen marked an inline comment as done.
dmgreen added a comment.

> OK.
> 
> Longer version:
> 
> Agreed, this is the place that gathers all symbolic strides for which runtime checks are later added by replaceSymbolicStrideSCEV().
> 
> Agreed, a gather or scatter would probably be a better option than a runtime check, certainly if the stride turns out to be other than 1, in which case the runtime check option will execute the original scalar loop.
>  Note that if the stride does turn out to be 1, a runtime check may be faster: the cost of a vector load/store is typically less than that of a gather/scatter, disregarding the overhead of the runtime check itself. So having a way to "manually" restore original performance for such cases may be useful (in addition to EnableMemAccessVersioning). Always preferring a gather or scatter as suggested should be good step forward, given the expected complexity of devising a cost-based preference.
> 
> Instead of teaching LAI to make such target/cost-based decisions, it would have been better to let this analysis continue to collect *all* symbolic strides potentially subject to runtime checks, and teach the planning/transform to prune/decide which strides to actually specialize; e.g., have LVP::plan() start by calling "CM.setVersioningForStrideDecisions()", analogous to InterleavedAccessInfo::analyzeInterleaving() which collects all potential interleave groups, and CM::setCostBasedWideningDecision() which decides which of the groups to materialize (per VF). However, this requires a fair amount of refactoring; worth a `TODO`?

Thanks for taking a look. I agree, this all feels like a layering violation to me! (I found an existing TODO in LAA saying the same thing). The refactoring does seem like a lot of work though. What can I say, I'm looking forward to a point where VPlan can make some of these decisions more structurally.

We (ARM MVE) don't have gathers/scatters enabled yet, this was just hitting the second thing I tried (matrix multiply). It may be a little while before we have them turned on by default.

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/optsize.ll:151
 ; AUTOVF-LABEL: for.body:
-define void @scev4stride1(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %k) #2 {
+define void @scev4stride1(i16* noalias nocapture %a, i16* noalias nocapture readonly %b, i32 %k) #2 {
 for.body.preheader:
----------------
Ayal wrote:
> Would indeed be good to have an i16 version retaining current checks (unvectorized behavior), as skx  supports gathers of i32 but not for i16; and also the original i32 version with checks for vectorized code.
The gather version of the original 32bit code here seems to be quite large for -Os. There are large constant pool entries that the gather needs to access?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71919/new/

https://reviews.llvm.org/D71919