[PATCH] D150288: [SLP] Simplify `tryToVectorizeSequence()`

Thu May 18 16:56:29 PDT 2023

ABataev added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:14339
+    auto &Seeds = Pair.second;
+    if (Seeds.empty())
+      continue;
----------------
Seeds.size() <= 1

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:14342
+    // Try the full Seeds vector first.
+    if (TryToVectorizeHelper(Seeds, /*MaxVFOnly=*/false)) {
       Changed = true;
----------------
What I'm afraid is that it may prevent some full-register optimizations. You will early optimize part register tress and miss some full-register trees (probably, with alternate opcodes)

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:14361-14365
-      auto GetMinNumElements = [&R](Value *V) {
-        unsigned EltSize = R.getVectorElementSize(V);
-        return std::max(2U, R.getMaxVecRegSize() / EltSize);
-      };
-      if (NumElts < GetMinNumElements(*IncIt) &&
----------------
vporpo wrote:
> ABataev wrote:
> > Does it make sense to drop this check?
> Isn't this already checked within `TryToVectorizeHelper`? If I am not mistaken both `vectorizeStores()` and `tryToVectorizeList()` already check for minimum VF.
It just early skips not profitable cases, saves some compile time, checks that vector can be fed with the elements, otherwise it does not worth trying vectorization, it will be scalarized.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:14276
-    function_ref<bool(ArrayRef<T *>, bool)> TryToVectorizeHelper,
-    bool MaxVFOnly, BoUpSLP &R) {
   bool Changed = false;
----------------
aeubanks wrote:
> vporpo wrote:
> > ABataev wrote:
> > > vporpo wrote:
> > > > ABataev wrote:
> > > > > ABataev wrote:
> > > > > > vporpo wrote:
> > > > > > > ABataev wrote:
> > > > > > > > vporpo wrote:
> > > > > > > > > ABataev wrote:
> > > > > > > > > > vporpo wrote:
> > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > vporpo wrote:
> > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > vporpo wrote:
> > > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > > vporpo wrote:
> > > > > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > > > > Removal of this parameter causes regressions in the vectorization, better to restore it.
> > > > > > > > > > > > > > > > > > Do you have some examples where this is needed?
> > > > > > > > > > > > > > > > > No, I don't, just from my experience when I added this parameter. Testing showed regressions in LTO mode with it, IIRC.
> > > > > > > > > > > > > > > > Is this in SPEC?
> > > > > > > > > > > > > > > IIRC, test/Transforms/SLPVectorizer/slp-max-phi-size.ll is one of the tests
> > > > > > > > > > > > > > Yes
> > > > > > > > > > > > > OK, I will run SPEC lto to check. 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > IIRC, test/Transforms/SLPVectorizer/slp-max-phi-size.ll is one of the tests
> > > > > > > > > > > > > Doesn't slp-max-phi-size.ll look OK though? As far as I understand the purpose of this test is to make sure that we don't vectorize wider than specified by the `-slp-max-vf` option.
> > > > > > > > > > > > Probably. If I do remember correctly, it caused regresssions because of too early optimization of phi nodes and some other instructions.
> > > > > > > > > > > I checked the C/C++ benchmarks of CPU2017 and I don't see any regressions. I also checked with our internal testing and there are no issues. So I think it is pretty safe. 
> > > > > > > > > > > In the unlikely case that it causes a regression, we can look into it again and add any missing tests. Wdyt?
> > > > > > > > > > Did you try LTO?
> > > > > > > > > Ah let me check. I guess you mean `-flto` right?
> > > > > > > > Yes. There were regressions because of too early phi nodes vectorization at compile time, which prevents some later optimizations at link time. We could fix this by checking if LTO is enabled and the stage is linking. In this case we can set the flag to true for phis and other nodes. It would be good if you could make a patch for this!
> > > > > > > I checked  an LTO build with and I didn't see any performance difference. It's all within the noise margin. 
> > > > > > > This is the list of test I tried:
> > > > > > > ```
> > > > > > > 500.perlbench_r
> > > > > > > 502.gcc_r
> > > > > > > 505.mcf_r
> > > > > > > 508.namd_r
> > > > > > > 511.povray_r
> > > > > > > 519.lbm_r
> > > > > > > 523.xalancbmk_r
> > > > > > > 525.x264_r
> > > > > > > 531.deepsjeng_r
> > > > > > > 538.imagick_r
> > > > > > > 544.nab_r
> > > > > > > 557.xz_r
> > > > > > > ```
> > > > > > > However, my build configuration may be different than yours. Could you try out the patch with your configuration and point me to the right benchmark/build options?
> > > > > > > 
> > > > > > > Changing the functionality depending on whether we are in the linking stage, just to make a SPEC test run faster in some specific configuration does not sound a very good solution to me. 
> > > > > > The solution is the same, just at link time we can ignore limits and do our best to vectorize even small sequences, which may cause regression at compile time.
> > > > > Also, try to compile everything with -march=native on something like Skylake
> > > > Yeah I used `native` for the builds, and I ran it on a `Xeon(R) Gold 6154`.
> > > > 
> > > > > just at link time we can ignore limits and do our best to vectorize even small sequences
> > > > So do you mean like setting `MaxVFOnly` to true only for the linking stage of LTO ?
> > > > Yeah I used `native` for the builds, and I ran it on a `Xeon(R) Gold 6154`.
> > > > 
> > > 
> > > Then check with the default target also (SSE2). Check D103638, D108740, D92059 for more context about the changes.
> > > 
> > > > > just at link time we can ignore limits and do our best to vectorize even small sequences
> > > > So do you mean like setting `MaxVFOnly` to true only for the linking stage of LTO ?
> > > 
> > > Yes, something like this. Because currently we have to limit some vectorization to avoid regressions with LTO.
> > > 
> > > Yes, something like this. Because currently we have to limit some vectorization to avoid regressions with LTO.
> > 
> > But how am I going to test how well this will work? I am pretty sure this won't be the only phase-ordering issue in LTO builds if SLP is enabled in pre-link. 
> https://reviews.llvm.org/D148010 sounds like the proper solution for this sort of issue
I rather doubt it would be good solution for SLP. I tried something like this before, saw lots of regressions. Better to run some optimizations (at full register, for example) at compile time and the remaining ones at link time, otherwise there might be not enough resources. SLP vectorizer has some internal limits and optimizing at link time only may exhaust it too fast.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150288/new/

https://reviews.llvm.org/D150288