[PATCH] D101726: [LV] Account for tripcount when calculation vectorization profitability

Wed May 5 08:14:15 PDT 2021

bmahjour added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5894
+    // If we are folding the tail and the trip count is a known (possibly small)
+    // constant, account for that by rounding the count up to the number of
+    // known iterations.
----------------
dmgreen wrote:
> bmahjour wrote:
> > dmgreen wrote:
> > > dmgreen wrote:
> > > > bmahjour wrote:
> > > > > but the count is the scalar trip count while the width is the vectorization factor....it seems odd that this calculation combines the two concepts together. Also if the trip count is a large constant, wouldn't the ceiling cause the VF to be effectively ignored?
> > > > > 
> > > > > Perhaps you can achieve the desired outcome by adjusting the cost elsewhere (say in `expectedCost`) and preferably with a target hook.
> > > > I'm not sure I understand what you mean. Can you explain in more detail? This doesn't sound like something that should be target dependent.
> > > > 
> > > > Let say the trip count is 5. CostA will be the cost of a single iteration with a vector factor of A.Width. If VF is 4 then the total cost of the loop vectorized 4x becomes CostA*ceil(5/4). The rounding up is due to us folding the tail and so executing 2 iterations in this case [*]. If B is scalar then the total scalar cost will be CostB*ceil(5/1) = CostB*5.  You then compare the two final costs to find the smallest, it being expected to be the most profitable.
> > > > 
> > > > So long as we are working in int64_t and the tripcount/cost are int32, this generalizes for any known trip count without rounding/overflow, I believe.
> > > > 
> > > > ([*] I thought about adding the non-FoldTailByMasking case too, but there the total cost is more like CostVec*floor(TripCount/VF) + CostScalar*(TripCount%VF) + SomeOverhead, which is more difficult to work out correctly. The existing CostVec/VF is at least an approximation of that.)
> > > Reading this again, perhaps the confusion comes from the old 'ceil' not being some form a fp ceil, but ceil(A/B)?  It wasn't the most descriptive name for a function.
> > Yes, sorry I saw the lambda `ceil` and didn't notice that it was doing a divide. My main concern was that the costs were not being normalized by the VF for larger trip counts, but now I see that they are, it's just that the trip count is rounded up and multiplied on both sides of the inequality.
> > 
> > Regarding overflow, I believe it's possible because `InstructionCost` uses `int`...so you might want to change it to `int64_t`.
> > 
> > Regarding target dependency, I'm wondering whether the motivating example is concerned with throughput or latency. The CostA and CostB above are mostly modeling throughput, so if the concern is latency, don't we need to take it into account in a target dependent way?
> I would expect CostA to be positive and (relatively) small. MaxTripCount can easily be something close to UINTMAX. Width will obviously be a low power of 2. Unsigned felt like the natural type, but changing it to signed sounds fine too.
> 
> My motivating case is... concerned with reciprocal throughput.. or that is at least close enough to it. Costing instructions isn't always simple, even on simple architectures. I don't believe it's any different to the existing code below, even if they are both a bit of an approximation.
Sorry if I wasn't clear, but I wasn't asking to change the type of `RTCostA` and `RTCostB` to signed int64_t. I was asking to change the underlying cost type inside `InstructionCost` from `int` to `int64_t`. Without that change, it's possible to cause an overflow in expressions like `= CostA * ceil(...)`. The other possible fix would be to enforce an upper bound on the value of `MaxTripCount` where this heuristic is being applied, but I think that's less desirable.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101726/new/

https://reviews.llvm.org/D101726