[llvm] [LV] Change loops' interleave count computation (PR #73766)

Wed Nov 29 06:49:30 PST 2023

================
@@ -5737,10 +5741,15 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
   // the InterleaveCount as if vscale is '1', although if some information about
   // the vector is known (e.g. min vector size), we can make a better decision.
   if (BestKnownTC) {
-    MaxInterleaveCount =
-        std::min(*BestKnownTC / VF.getKnownMinValue(), MaxInterleaveCount);
-    // Make sure MaxInterleaveCount is greater than 0.
-    MaxInterleaveCount = std::max(1u, MaxInterleaveCount);
+    if (InterleaveSmallLoopScalarReduction ||
+        (*BestKnownTC % VF.getKnownMinValue() == 0))
+      MaxInterleaveCount =
+          std::min(*BestKnownTC / VF.getKnownMinValue(), MaxInterleaveCount);
+    else
+      MaxInterleaveCount = std::min(*BestKnownTC / (VF.getKnownMinValue() * 2),
+                                    MaxInterleaveCount);
+    // Make sure MaxInterleaveCount is greater than 0 & a power of 2.
+    MaxInterleaveCount = llvm::bit_floor(std::max(1u, MaxInterleaveCount));
----------------
david-arm wrote:

My observation, based purely on your algorithm, suggests that you're trying to minimise the amount of work we will do in the scalar epilogue. Is that right? If so I then I agree it makes to rework this code! Let's take a few situations as examples:

==Before==
TC=32,VF=16 -> Choose IC=2. No scalar tail.
TC=31,VF=16 -> Choose IC=1. 15 elements in scalar tail.
TC=33,VF=16 -> Choose IC=2. 1 element in scalar tail.
TC=48,VF=16 -> Choose IC=2. 16 elements in scalar tail.
TC=63,VF=16 -> Choose IC=2. 31 elements in scalar tail.

==After==
TC=32,VF=16 -> Choose IC=2. No scalar tail.
TC=31,VF=16 -> Choose IC=1. 15 elements in scalar tail.
TC=33,VF=16 -> Choose IC=1. 1 element in scalar tail.
TC=48,VF=16 -> Choose IC=2. 16 elements in scalar tail.
TC=63,VF=16 -> Choose IC=1. 15 elements in scalar tail.

So for these limited examples, it looks like you're trying to reduce the number of elements we process in the scalar tail for TC in the range 49-63. However, it's not solving the problem for TC=48,VF=16 where you actually have an opportunity to completely eliminate the tail by choosing IC=1.

This is just a suggestion, but if what you really care about is reducing work in the scalar epilogue you could just calculate the number of iterations left for each case:

      MaxInterleaveCount =
          std::min(*BestKnownTC / VF.getKnownMinValue(), MaxInterleaveCount);

and

      MaxInterleaveCount =
          std::min(*BestKnownTC / (VF.getKnownMinValue() * 2), MaxInterleaveCount);


then choose the most efficient of the two?

For the TC=33,VF=16 case it's not obvious that IC=2 is any worse than IC=1. If anything, I'd expect one less compare + branch to be cheaper with IC=2? 

https://github.com/llvm/llvm-project/pull/73766