[PATCH] D101726: [LV] Account for tripcount when calculation vectorization profitability

Tue May 4 10:53:09 PDT 2021

bmahjour added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5894
+    // If we are folding the tail and the trip count is a known (possibly small)
+    // constant, account for that by rounding the count up to the number of
+    // known iterations.
----------------
dmgreen wrote:
> dmgreen wrote:
> > bmahjour wrote:
> > > but the count is the scalar trip count while the width is the vectorization factor....it seems odd that this calculation combines the two concepts together. Also if the trip count is a large constant, wouldn't the ceiling cause the VF to be effectively ignored?
> > > 
> > > Perhaps you can achieve the desired outcome by adjusting the cost elsewhere (say in `expectedCost`) and preferably with a target hook.
> > I'm not sure I understand what you mean. Can you explain in more detail? This doesn't sound like something that should be target dependent.
> > 
> > Let say the trip count is 5. CostA will be the cost of a single iteration with a vector factor of A.Width. If VF is 4 then the total cost of the loop vectorized 4x becomes CostA*ceil(5/4). The rounding up is due to us folding the tail and so executing 2 iterations in this case [*]. If B is scalar then the total scalar cost will be CostB*ceil(5/1) = CostB*5.  You then compare the two final costs to find the smallest, it being expected to be the most profitable.
> > 
> > So long as we are working in int64_t and the tripcount/cost are int32, this generalizes for any known trip count without rounding/overflow, I believe.
> > 
> > ([*] I thought about adding the non-FoldTailByMasking case too, but there the total cost is more like CostVec*floor(TripCount/VF) + CostScalar*(TripCount%VF) + SomeOverhead, which is more difficult to work out correctly. The existing CostVec/VF is at least an approximation of that.)
> Reading this again, perhaps the confusion comes from the old 'ceil' not being some form a fp ceil, but ceil(A/B)?  It wasn't the most descriptive name for a function.
Yes, sorry I saw the lambda `ceil` and didn't notice that it was doing a divide. My main concern was that the costs were not being normalized by the VF for larger trip counts, but now I see that they are, it's just that the trip count is rounded up and multiplied on both sides of the inequality.

Regarding overflow, I believe it's possible because `InstructionCost` uses `int`...so you might want to change it to `int64_t`.

Regarding target dependency, I'm wondering whether the motivating example is concerned with throughput or latency. The CostA and CostB above are mostly modeling throughput, so if the concern is latency, don't we need to take it into account in a target dependent way?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101726/new/

https://reviews.llvm.org/D101726