[PATCH] D132585: [VPlan] Add field to track if intrinsic should be used for call. (NFC)

Wed Aug 31 13:01:41 PDT 2022

Ayal accepted this revision.
Ayal added a comment.
This revision is now accepted and ready to land.

Thanks for addressing, looks good to me, adding minor last nits.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8329
+        InstructionCost CallCost =
+            CM.getVectorCallCost(CI, VF, NeedToScalarize);
+        InstructionCost IntrinsicCost =
----------------
fhahn wrote:
> Ayal wrote:
> > fhahn wrote:
> > > Ayal wrote:
> > > > Avoid considering CallCost if NeedToScalarize is true?
> > > > 
> > > > Avoid getting decision and clamping Range if !ID, when a vector call can be used, e.g., w/o clamping Range (WillWiden)?
> > > > 
> > > > The compound decision for which (range of) VF's to use an intrinsic vs. call vs. neither should probably be retained instead of decomposing it into two independent clamps? Calls for better test coverage to make sure patch is indeed NFC.
> > > > Avoid considering CallCost if NeedToScalarize is true?
> > > 
> > > I am not sure if we need to handle this explicitly, as the cost comparison should either chose the vector intrinsic (if it is cheaper than the lib call which may get scalarized) or `CanUseVectorCall` will be also false.
> > > 
> > > > Avoid getting decision and clamping Range if !ID, when a vector call can be used, e.g., w/o clamping Range (WillWiden)?
> > > 
> > > Added a check, thanks!
> > > 
> > > > The compound decision for which (range of) VF's to use an intrinsic vs. call vs. neither should probably be retained instead of decomposing it into two independent clamps? Calls for better test coverage to make sure patch is indeed NFC.
> > > 
> > > I think we need to clamp both separately. Before, we could have VPlans where we either use lib functions or intrinsics for the same call for different VFs. Now we need to split them to track whether an intrinsic or libfunc should be used. I added a test case to show this: 005d1a8ff533
> > > 
> > > It should only change the debug output (VPlan printing) but not the generated code, so arguably this can be considered NFC (from the perspective of the generated code) or not.
> > >> The compound decision for which (range of) VF's to use an intrinsic vs. call vs. neither should probably be retained instead of decomposing it into two independent clamps? Calls for better test coverage to make sure patch is indeed NFC.
> > 
> > > I think we need to clamp both separately. Before, we could have VPlans where we either use lib functions or intrinsics for the same call for different VFs. Now we need to split them to track whether an intrinsic or libfunc should be used. I added a test case to show this: 005d1a8ff533
> > 
> > Hmm, getDecisionAndClampRange() works with boolean decisions rather than 3-way ones. May result in excessive clamping, which is ok albeit potentially conservative. E.g., say first VF=2 of range can make a vector call but next VF=4 cannot, where both can more efficiently make an intrinsic call, range would clamp after VF=2 needlessly.
> > 
> > One way to optimize the clamping is to figure out the compound decision for first VF of range and then getDecisionAndClampRange() accordingly - worth the hassle?
> > 
> > 
> > ```
> >       bool ScalarBetterThanVectorAtStart;
> >       InstructionCost CallCostAtStart =
> >                 CM.getVectorCallCost(CI, Range.Start, ScalarBetterThanVectorAtStart);
> >       bool IntrinsicBestAtStart = ID && CM.getVectorIntrinsicCost(CI, Range.Start) < CallCostAtStart;
> > 
> >       LoopVectorizationPlanner::getDecisionAndClampRange(
> >           [&](ElementCount VF) -> bool {
> >             bool ScalarBetterThanVectorAtVF;
> >             // Is it beneficial to perform intrinsic call compared to lib call?
> >             InstructionCost CallCostAtVF =
> >                 CM.getVectorCallCost(CI, VF, ScalarBetterThanVectorAtVF);
> >             bool IntrinsicBestAtVF = ID && CM.getVectorIntrinsicCost(CI, VF) < CallCostAtVF;
> >             return (IntrinsicBestAtStart == IntrinsicBestAtVF) &&
> >                        (IntrinsicBestAtStart || ScalarBetterThanVectorAtVF == ScalarBetterThanVectorAtVF);
> >           },
> >           Range);
> > ```
> > 
> > CM.getVectorCallCost() already compares vector call cost with scalar call cost, returning the cheaper along with an indicator which is it.
> > Perhaps worth extending this API to compare the three alternatives, returning the cheapest along with an indicator(s) which is it(?)
> > 
> > > It should only change the debug output (VPlan printing) but not the generated code, so arguably this can be considered NFC (from the perspective of the generated code) or not.
> > 
> Hm I tried to restructure to code to make things a bit clearer.
> 
> If we can use an intrinsic call, clamp the decision to the range of intrinsic calls and return the recipe. If the intrinsic call is profitable at the start, we clamp the range until it becomes unprofitable. If it is not profitable at the beginning, we should clamp the range until it becomes profitable.
> 
> If it is not profitable to use an intrinsic call at the start, it must be profitable to use a lib call. Now clamp to the range until lib calls are not profitable.
> 
> I *think* that should avoid excessive clamping in most cases in practice and the code seems easier to follow. WDYT?
> Hm I tried to restructure to code to make things a bit clearer.
>
> If we can use an intrinsic call, clamp the decision to the range of intrinsic calls and return the recipe. If the intrinsic call is profitable at the start, we clamp the range until it becomes unprofitable. If it is not profitable at the beginning, we should clamp the range until it becomes profitable.

Agreed! "profitable" here means "most profitable/best", i.e., better than scalarizing and better than calling a vector library function.

> If it is not profitable to use an intrinsic call at the start, it must be profitable to use a lib call. Now clamp to the range until lib calls are not profitable.

It is also possible that scalarizing is most profitable at start. In any case it's indeed fine to now clamp based on the better between scalarizing and using a lib call (which is best, i.e., also better than using an intrinsic), as done below.

> I *think* that should avoid excessive clamping in most cases in practice and the code seems easier to follow. WDYT?

Agreed, excessive clamping is avoided and code is clearer, LGTM!

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8325
+                Range);
+  if (ShouldUseVectorIntrinsic) {
+    return new VPWidenCallRecipe(*CI, make_range(Ops.begin(), Ops.end()), ID);
----------------
nits: can drop `Should`, {}

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8333
+        // The following case may be scalarized depending on the VF.
+        // The flag shows whether we can use a usual Call for vectorized
+        // version of the instruction.
----------------
Maybe the following:
```
// The flag shows whether it is better to scalarize the call than to call a vectorized version of the function.
```
is a bit more accurate?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8340
+      Range);
+  if (ShouldUseVectorCall) {
+    return new VPWidenCallRecipe(*CI, make_range(Ops.begin(), Ops.end()),
----------------
nits: can drop `Should`, {}

================
Comment at: llvm/lib/Transforms/Vectorize/VPlan.h:949
 class VPWidenCallRecipe : public VPRecipeBase, public VPValue {
+  Intrinsic::ID VectorIntrinsicID;

----------------
nit: comment that not_intrinsic/false indicates that a library call is used instead of an intrinsic.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132585/new/

https://reviews.llvm.org/D132585