[PATCH] D122148: [SLP] Peek into loads when hitting the RecursionMaxDepth

Mon Apr 4 07:53:54 PDT 2022

ABataev added a comment.

In D122148#3426311 <https://reviews.llvm.org/D122148#3426311>, @dmgreen wrote:

> SPEC I've tried for AArch64 - that was one of the places that showed the need for something like this. For AArch64 along with D122145 <https://reviews.llvm.org/D122145> this helps one of the routines in x264 to speed up the whole thing. The exact speedup is quite dependant on the ordering of shuffles chosen, and may need some more work to come out as good as it can. The introduction of select shuffles wasn't great for targets without them. We are talking about something like a 5% improvement.
>
> I've been trying to run X86 too, but I don't have a great setup for it. On what I think is some sort of "Xeon 6148" thing, these were the performance scores I saw from running SPEC 2017 with -march=native (again with this and D122145 <https://reviews.llvm.org/D122145>):
>
>   500.perlbench_r	0.829412281
>   502.gcc_r	0.398122431
>   505.mcf_r	-0.352758469
>   520.omnetpp_r	0.256906516
>   523.xalancbmk_r	-0.789195872
>   525.x264_r	0.198126785
>   531.deepsjeng_r	-0.024245574
>   541.leela_r	0.003397322
>   557.xz_r	-0.722254612
>
> They can be quite noisy though I'm afraid, even if those results are averaged between three runs. I wouldn't be surprised if all those changes were down to machine noise or knock-on alignment changes. We have a better setup for AArch64 than we do for X86, but there is always some noise.
>
> I tried 2006 too on AArch64. It was only the perlbench binary that changed there, according to the file hashes. And the performance was the same.

I believe that this patch and D122145 <https://reviews.llvm.org/D122145> can be landed, just need to tweak them a bit.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4102-4108
+         all_of(VL,
+                [](const Value *I) {
+                  return match(I, m_ZExt(m_Load(m_Value())));
+                }) ||
+         all_of(VL, [](const Value *I) {
+           return match(I, m_SExt(m_Load(m_Value())));
+         })))) {
----------------
I would also add a check that we have `VL.size() >= 4` values, for 2 elements it may cause regressions and, probably, single uses, to avoid extra extractelements.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D122148/new/

https://reviews.llvm.org/D122148