[PATCH] D57598: [VPLAN] Determine Vector Width programmatically.

Wed Mar 13 13:24:43 PDT 2019

fpetrogalli marked 16 inline comments as done.
fpetrogalli added a comment.

Hi all,

I finally addressed your comments and update the patch. Let me know what you think.

Francesco

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:1001
   InstWidening getWideningDecision(Instruction *I, unsigned VF) {
-    assert(VF >= 2 && "Expected VF >=2");
-
----------------
fpetrogalli wrote:
> fpetrogalli wrote:
> > npanchen wrote:
> > > fhahn wrote:
> > > > Is this related to the patch? I suppose guessVectorizationFactor could set UserVF to 1, which could reach this code. It would be better to not vectorize with VF == 1.
> > > For VF = 1 CM is not called, thus this change looks as unrelated to this commit. If it does, what is a reason and why other places in CM are not modified ?
> > Yes, this is related to this patch. By guessing the number of lanes with the algorithm i proposed, all the  target independent invocations of `opt` that use VPLAN end up reporting vector registers of size 32-bit (`TTI->getRegisterBitWidth(true /* Vector*/)`), which means the vectorization performed in some of the tests are done with a vectorization factor of 1. I am not saying this is the right thing to do, I was actually hacking something that worked to initiate the discussion.
> This assertion was firing when testing `test/Transforms/LoopVectorize/explicit_outer_detection.ll`. Because the test is run with no explicit vector with (this patch removes such need) and without specifying a target for `opt`, the vectorizer was generating a vector loop with one lane, for the same reason explained in my previsous comment (no target implies `TTI->getRegisterBitWidth(true /* Vector*/)` returning 32).
I restored this assertion.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:3787
     // set at the end of vector code generation.
-    Type *VecTy =
-        (VF == 1) ? PN->getType() : VectorType::get(PN->getType(), VF);
+    Type *VecTy = VectorType::get(PN->getType(), VF);
     Value *VecPhi = Builder.CreatePHI(VecTy, PN->getNumOperands(), "vec.phi");
----------------
fpetrogalli wrote:
> fhahn wrote:
> > Related to the patch?
> Yes. Again, this was needed because of an assertion firing when building a binary operator in the process of vectorizing `case2` in `test/Transforms/LoopVectorize/explicit_outer_detection.ll`. The loop in `case2` get's vectorized with this patch, with a vectorization factor of 1, but the code generation in the inner loop vectorizer is trying to build a binary operator between a scalar phi and a vector consisting of one lane, which is of course not possible. This change makes sure that the phi is generated not as a scalar but a one lane vector to prevent the failure when building the binary operator.
I restored this code.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7098

+unsigned guessVPlanVF(Loop &L, unsigned WidestVectorRegBits) {
+  unsigned Max = 1;
----------------
fpetrogalli wrote:
> npanchen wrote:
> > hsaito wrote:
> > > This is in some sense reinventing the wheel, and I kind of imagine this can quickly get out of control if different people start trying to add different constraints of their interest.
> > > 
> > > Let's start talking about how we can get to the point of being able to use unified computeMaxVF(). For this patch, however, I'd be happy if we get to the point of being able to use unified getSmallestAndWidestTypes().
> > Agree with Hideki. Ideally, this function has to return a range [MinVF, MaxVF], which can be used in buildVPlans() to minimize number of built VPlans.
> > 
> > Please, remember that target can have different vector registers.
> Responding to both comments from @hsaito and @npanchen here:
> 
>  > This is in some sense reinventing the wheel
> 
> I fully agree :).  I needed a starting point to be able to discuss this with you.
> 
> > Let's start talking about how we can get to the point of being able to use unified computeMaxVF(). For this patch, however, I'd be happy if we get to the point of being able to use unified getSmallestAndWidestTypes().
> 
> OK, I will look into getting access to getSmallestAndWidestTypes here.
> 
> > Please, remember that target can have different vector registers.
> 
> Yes, I understand that using the widest ones is not ideal, as you miss for example 2-lane vectorization on floating point data on machines that support both 64-bit and 128-bit vector registers (for example `aarch64`).
> 
> Just to make sure we are on the same page here. This could be solved by considering all possible power of two vector widths up to the maximum one. For example, if the loop is processing 32-bit data, and the target has 128, 256, and 512-bit wide registers, we should ask  VPlan to consider 4, 8 and 16 lanes vectorizations. Is my interpretation correct here?
I have a version that does  [MinVF, MaxVF] computation, but this triggers the assertion in this method:

```
 void LoopVectorizationPlanner::setBestPlan(unsigned VF, unsigned UF) {
   LLVM_DEBUG(dbgs() << "Setting best plan to VF=" << VF << ", UF=" << UF
                     << '\n');
   BestVF = VF;
   BestUF = UF;

   erase_if(VPlans, [VF](const VPlanPtr &Plan) {
     return !Plan->hasVF(VF);
   });
   assert(VPlans.size() == 1 && "Best VF has not a single VPlan.");
 }```

Am I correct thinking that to be able to choose among different plans, we need to provide VPLAN with a cost model that is able to evaluate all the options? If that's the case, I think that generating Max and Min VF goes beyond the scope of this patch.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7103
+    for (auto I = B->begin(), E=B->end(); I!=E; ++I ) {
+      if (!isa<LoadInst>(*I) && !isa<StoreInst>(*I))
+        continue;
----------------
npanchen wrote:
> fpetrogalli wrote:
> > npanchen wrote:
> > > For a cases when load and store instructions were hoisted out of the loop, this function will return WidestVectorRegBits, which could be >= 128.
> > Sorry I don't understand this comment, could you please explain it with and example? Also, given that my understanding is that you want me to use `getSmallestAndWidestTypes`, is the comment still valid? Because I think that by using `getSmallestAndWidestTypes` I essentially will remove my custom code.
> The example I was thinking about is pretty simple:
> 
> 
> ```
> #pragma clang loop vectorize(enable)
> for (i = 0; i < n; ++i) {
>   red += i;
> }
> 
> ```
> After some optimization the loop body can look like:
> 
> 
> ```
> loop_body:
> %1 = phi i64     %i_init,              %i
> %2 = phi i32     %red_init,          %red
> %red = %2 + %1
> %i = %1 + 1
> cmp %i, %n
> ```
> 
> In this case guessVPlanVF() will not find any LD/ST instruction, thus Max will be equal to 1 and WidestVectorRegBits will be returned. 
> For example, in case of SKX WidestVectorRegBits = 512.
The new version doesn't use my custom code but `getSmallestAndWidestTypes`. I could add a test that does the hoisting on outer loops, but I am not sure how to construct this example. Can you provide a C example to start with? Or are you happy for me to skip this check?

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7106
+
+      Type *ScalarDataTy = getMemInstValueType(&(*I));
+      unsigned Size = ScalarDataTy->getPrimitiveSizeInBits();
----------------
npanchen wrote:
> getMemInstValueType(I)
This code is not used anymore in the last version.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7141
+  if (!UserVF)
+    UserVF = guessVPlanVF(*L, TTI->getRegisterBitWidth(true /* Vector*/));
+
----------------
dcaballe wrote:
> Regarding the issue with VF = 1, we are using VF = 1 to denote no vectorization and I think we should preserve that behavior. This basically means we shouldn't try to generate vector code (shouldn't invoke `getWideningDecision` et al.) with VF = 1.
> 
> I think the problem happens because we are setting `UserVF` here, instead of `VF`, and probably `1` is an unexpected value for `UserVF` (just guessing, I haven't checked it out). Maybe we should leave `UserVF` to values actually coming from the user and set `VF` instead? If you move this code to `planInVPlanNativePath`, as suggested, I think the VF = 1 problem would be fixed.
The VF = 1 case is not generated anymore after the last change set, marking this comment as done.

================
Comment at: test/Transforms/LoopVectorize/outer_loop_test1_no_explicit_vect_width.ll:90
+  %0 = trunc i64 %indvars.iv21 to i32
+  store i32 %0, i32* %arrayidx, align 4
+  %1 = trunc i64 %indvars.iv21 to i32
----------------
fhahn wrote:
> Looks like we are just using a single store width in this test. Maybe it would be worth adding loads/stores to a different type as well?
The patch doesn't rely anymore on the load/store types, but uses `getSmallestAndWidestTypes`. Do you still want me to add such test, or can we trust `getSmallestAndWidestTypes` of doing the right job here?

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D57598/new/

https://reviews.llvm.org/D57598