[PATCH] D57598: [VPLAN] Determine Vector Width programmatically.

Mon Feb 4 08:38:55 PST 2019

fpetrogalli marked 5 inline comments as done.
fpetrogalli added a comment.

@hsaito , @fhahn , @npanchen ,

Thank you for looking into this.

I have responded to all comments about the design. I will work on a new patch an submit it according to your suggestions. For now, I have ignored those comments on trivial changes (for example, @fhahn  comment on adding more test coverage). I will fix such comments and update them along the way with the re-implementation of the patch.

Thank you,

Francesco

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:1001
   InstWidening getWideningDecision(Instruction *I, unsigned VF) {
-    assert(VF >= 2 && "Expected VF >=2");
-
----------------
npanchen wrote:
> fhahn wrote:
> > Is this related to the patch? I suppose guessVectorizationFactor could set UserVF to 1, which could reach this code. It would be better to not vectorize with VF == 1.
> For VF = 1 CM is not called, thus this change looks as unrelated to this commit. If it does, what is a reason and why other places in CM are not modified ?
Yes, this is related to this patch. By guessing the number of lanes with the algorithm i proposed, all the  target independent invocations of `opt` that use VPLAN end up reporting vector registers of size 32-bit (`TTI->getRegisterBitWidth(true /* Vector*/)`), which means the vectorization performed in some of the tests are done with a vectorization factor of 1. I am not saying this is the right thing to do, I was actually hacking something that worked to initiate the discussion.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:1001
   InstWidening getWideningDecision(Instruction *I, unsigned VF) {
-    assert(VF >= 2 && "Expected VF >=2");
-
----------------
fpetrogalli wrote:
> npanchen wrote:
> > fhahn wrote:
> > > Is this related to the patch? I suppose guessVectorizationFactor could set UserVF to 1, which could reach this code. It would be better to not vectorize with VF == 1.
> > For VF = 1 CM is not called, thus this change looks as unrelated to this commit. If it does, what is a reason and why other places in CM are not modified ?
> Yes, this is related to this patch. By guessing the number of lanes with the algorithm i proposed, all the  target independent invocations of `opt` that use VPLAN end up reporting vector registers of size 32-bit (`TTI->getRegisterBitWidth(true /* Vector*/)`), which means the vectorization performed in some of the tests are done with a vectorization factor of 1. I am not saying this is the right thing to do, I was actually hacking something that worked to initiate the discussion.
This assertion was firing when testing `test/Transforms/LoopVectorize/explicit_outer_detection.ll`. Because the test is run with no explicit vector with (this patch removes such need) and without specifying a target for `opt`, the vectorizer was generating a vector loop with one lane, for the same reason explained in my previsous comment (no target implies `TTI->getRegisterBitWidth(true /* Vector*/)` returning 32).

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:3787
     // set at the end of vector code generation.
-    Type *VecTy =
-        (VF == 1) ? PN->getType() : VectorType::get(PN->getType(), VF);
+    Type *VecTy = VectorType::get(PN->getType(), VF);
     Value *VecPhi = Builder.CreatePHI(VecTy, PN->getNumOperands(), "vec.phi");
----------------
fhahn wrote:
> Related to the patch?
Yes. Again, this was needed because of an assertion firing when building a binary operator in the process of vectorizing `case2` in `test/Transforms/LoopVectorize/explicit_outer_detection.ll`. The loop in `case2` get's vectorized with this patch, with a vectorization factor of 1, but the code generation in the inner loop vectorizer is trying to build a binary operator between a scalar phi and a vector consisting of one lane, which is of course not possible. This change makes sure that the phi is generated not as a scalar but a one lane vector to prevent the failure when building the binary operator.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7098

+unsigned guessVPlanVF(Loop &L, unsigned WidestVectorRegBits) {
+  unsigned Max = 1;
----------------
npanchen wrote:
> hsaito wrote:
> > This is in some sense reinventing the wheel, and I kind of imagine this can quickly get out of control if different people start trying to add different constraints of their interest.
> > 
> > Let's start talking about how we can get to the point of being able to use unified computeMaxVF(). For this patch, however, I'd be happy if we get to the point of being able to use unified getSmallestAndWidestTypes().
> Agree with Hideki. Ideally, this function has to return a range [MinVF, MaxVF], which can be used in buildVPlans() to minimize number of built VPlans.
> 
> Please, remember that target can have different vector registers.
Responding to both comments from @hsaito and @npanchen here:

 > This is in some sense reinventing the wheel

I fully agree :).  I needed a starting point to be able to discuss this with you.

> Let's start talking about how we can get to the point of being able to use unified computeMaxVF(). For this patch, however, I'd be happy if we get to the point of being able to use unified getSmallestAndWidestTypes().

OK, I will look into getting access to getSmallestAndWidestTypes here.

> Please, remember that target can have different vector registers.

Yes, I understand that using the widest ones is not ideal, as you miss for example 2-lane vectorization on floating point data on machines that support both 64-bit and 128-bit vector registers (for example `aarch64`).

Just to make sure we are on the same page here. This could be solved by considering all possible power of two vector widths up to the maximum one. For example, if the loop is processing 32-bit data, and the target has 128, 256, and 512-bit wide registers, we should ask  VPlan to consider 4, 8 and 16 lanes vectorizations. Is my interpretation correct here?

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7103
+    for (auto I = B->begin(), E=B->end(); I!=E; ++I ) {
+      if (!isa<LoadInst>(*I) && !isa<StoreInst>(*I))
+        continue;
----------------
npanchen wrote:
> For a cases when load and store instructions were hoisted out of the loop, this function will return WidestVectorRegBits, which could be >= 128.
Sorry I don't understand this comment, could you please explain it with and example? Also, given that my understanding is that you want me to use `getSmallestAndWidestTypes`, is the comment still valid? Because I think that by using `getSmallestAndWidestTypes` I essentially will remove my custom code.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D57598/new/

https://reviews.llvm.org/D57598