[PATCH] D71919: [LoopVectorize] Disable single stride access predicates when gather loads are available.

Tue Dec 31 03:46:12 PST 2019

Ayal added a comment.

> The LoopVectorizer/LAA has the ability to add runtime checks for memory accesses that look like they may be single stride accesses, in an attempt to still run vectorized code. This can happen in a boring matrix multiply kernel, for example: [snip]
> 
> However if we have access to efficient vector gather loads, they should be are a much better option than vectoizing with runtime checks for a stride of 1.
> 
> This adds a check into the place that appears to be dictating this, LAA, to check if the MaskedGather or MaskedScatter would be legal.

OK.

Longer version:

Agreed, this is the place that gathers all symbolic strides for which runtime checks are later added by replaceSymbolicStrideSCEV().

Agreed, a gather or scatter would probably be a better option than a runtime check, certainly if the stride turns out to be other than 1, in which case the runtime check option will execute the original scalar loop.
Note that if the stride does turn out to be 1, a runtime check may be faster: the cost of a vector load/store is typically less than that of a gather/scatter, disregarding the overhead of the runtime check itself. So having a way to "manually" restore original performance for such cases may be useful (in addition to EnableMemAccessVersioning). Always preferring a gather or scatter as suggested should be good step forward, given the expected complexity of devising a cost-based preference.

Instead of teaching LAI to make such target/cost-based decisions, it would have been better to let this analysis continue to collect *all* symbolic strides potentially subject to runtime checks, and teach the planning/transform to prune/decide which strides to actually specialize; e.g., have LVP::plan() start by calling "CM.setVersioningForStrideDecisions()", analogous to InterleavedAccessInfo::analyzeInterleaving() which collects all potential interleave groups, and CM::setCostBasedWideningDecision() which decides which of the groups to materialize (per VF). However, this requires a fair amount of refactoring; worth a `TODO`?

================
Comment at: llvm/lib/Analysis/LoopAccessAnalysis.cpp:2287
+  if (LoadInst *LI = dyn_cast<LoadInst>(MemAccess)) {
     Ptr = LI->getPointerOperand();
+    if (TTI && TTI->isLegalMaskedGather(LI->getType(),
----------------
Separate the existing `Value *Ptr = getLoadStorePointerOperand(MemAccess);  if (!Ptr) return;`  part from the new gather/scatter consideration?

Would have been nice to reuse LV's `isLegalGatherOrScatter(Value *V)`, or perhaps refactor it into `if (TTI && TTI->isLegalGatherOrScatter(MemAccess)) return;`?

Worth informing of filtered strides with LLVM_DEBUG messages.

(Can check if Ptr is already in SymbolicStrides and exit early; unrelated change.)

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/optsize.ll:151
 ; AUTOVF-LABEL: for.body:
-define void @scev4stride1(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %k) #2 {
+define void @scev4stride1(i16* noalias nocapture %a, i16* noalias nocapture readonly %b, i32 %k) #2 {
 for.body.preheader:
----------------
Would indeed be good to have an i16 version retaining current checks (unvectorized behavior), as skx  supports gathers of i32 but not for i16; and also the original i32 version with checks for vectorized code.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71919/new/

https://reviews.llvm.org/D71919