[PATCH] D81416: [LV][SLP] Interleave to expose ILP for small loops with scalar reductions.

Mon Jun 15 13:14:25 PDT 2020

AaronLiu added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5368
+  LLVM_DEBUG(dbgs() << "LV: Loop cost is " << LoopCost << '\n'
+                    << "LV: IC is " << IC << '\n'
+                    << "LV: VF is " << VF << '\n');
----------------
bmahjour wrote:
> IC reported here may be different from the interleave count that is finally returned from this function. It's probably better not to emit it here since it's not finalized. The VF is also available elsewhere in the debug trace, so not sure if it's worth changing this debug output.
Thanks for the review @bmahjour! Correct, IC may be different from the interleave count that is finally returned, add debug options here for IC is to show before and after "Interleaving to expose ILP". For example if you add "-mllvm -debug-only=loop-vectorize" for the clang/clang++ invocation, after compiling the provided testcase, you will get something like the following output:
...
LV: Loop cost is 8
LV: IC is 8
LV: VF is 1
LV: Interleaving to expose ILP.
...
LV: Interleave Count is 4
Setting best plan to VF=1, UF=4
...
There are only two lines added here, comparing with tons of debug output for all instructions by the LV costmodel and digraph VPlan debug output, this is very little. And I find that the very little info is very useful for knowing what's going on at this point.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5409
+      LLVM_DEBUG(dbgs() << "LV: Interleaving to expose ILP.\n");
+      return std::max(IC / 2, SmallIC);
+    } else {
----------------
bmahjour wrote:
> What's the significance of the value `2` here?
Still use the above output as an example: the normal IC is 8, and SmallIC is definitely no more than 2 after calculation. SmallIC is too small and will not benefit SLP, and the provided testcase will not be vectorized. The normal IC is a little bit big in some rare situation when resources are too limited, for example in full width runs when all CPUs are running. The division by 2 here make it not that aggressive as the normal IC, but still can vectorize the testcase.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D81416/new/

https://reviews.llvm.org/D81416