[all-commits] [llvm/llvm-project] 87d775: [SLP] Control maximum vectorization factor from TTI

Mon Dec 14 08:50:11 PST 2020

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 87d7757bbe14fed420092071ded3430072053316
      https://github.com/llvm/llvm-project/commit/87d7757bbe14fed420092071ded3430072053316
  Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin at amd.com>
  Date:   2020-12-14 (Mon, 14 Dec 2020)

  Changed paths:
    M llvm/include/llvm/Analysis/TargetTransformInfo.h
    M llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
    M llvm/lib/Analysis/TargetTransformInfo.cpp
    M llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
    M llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
    M llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
    M llvm/test/Transforms/SLPVectorizer/AMDGPU/add_sub_sat.ll
    M llvm/test/Transforms/SLPVectorizer/AMDGPU/round.ll
    M llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll

  Log Message:
  -----------
  [SLP] Control maximum vectorization factor from TTI

D82227 has added a proper check to limit PHI vectorization to the
maximum vector register size. That unfortunately resulted in at
least a couple of regressions on SystemZ and x86.

This change reverts PHI handling from D82227 and replaces it with
a more general check in SLPVectorizerPass::tryToVectorizeList().
Moved to tryToVectorizeList() it allows to restart vectorization
if initial chunk fails.

However, this function is more general and handles not only PHI
but everything which SLP handles. If vectorization factor would
be limited to maximum vector register size it would limit much
more vectorization than before leading to further regressions.
Therefore a new TTI callback getMaximumVF() is added with the
default 0 to preserve current behavior and limit nothing. Then
targets can decide what is better for them.

The callback gets ElementSize just like a similar getMinimumVF()
function and the main opcode of the chain. The latter is to avoid
regressions at least on the AMDGPU. We can have loads and stores
up to 128 bit wide, and <2 x 16> bit vector math on some
subtargets, where the rest shall not be vectorized. I.e. we need
to differentiate based on the element size and operation itself.

Differential Revision: https://reviews.llvm.org/D92059