[PATCH] D42447: [LV][VPlan] Detect outer loops for explicit vectorization.

Fri Mar 9 01:44:00 PST 2018

rengolin added inline comments.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:263
+static cl::opt<bool> EnableVPlanNativePath(
+    "enable-vplan-native-path", cl::init(false), cl::Hidden,
+    cl::desc("Enable VPlan-native vectorization path with "
----------------
Right now, this is just enabling outer-loop. Are you planning on adding more functionality to the native part of VPlan before merging the inner loop vectoriser into it? I wouldn't recommend, as we really don't want two paths in parallel for too long.

I'd recommend this to just be called "vectorize-outerloop" or something.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:2319
+// LoopVectorizeHints). This will be fixed in the future when the native IR
+// representation for pragma 'omp simd' is introduced.
+static bool isExplicitVecOuterLoop(Loop *OuterLp,
----------------
dcaballe wrote:
> rengolin wrote:
> > Sorry, this is my fault, as both were done separately. We discussed adding one more metadata info to mean {"forced"/"hint"} but ended up never doing it. It should be simple to fix this, I think, just need to make sure we change all the tests correctly to what they're supposed to mean.
> No problem. As I mention, there are going to be changes regarding the representation of #pragma omp simd. I think that's the right time to address the problem.
Yup.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:2363
       V.push_back(&L);
-    return;
+      // TODO: Collect inner loops inside marked outer loops in case
+      // vectorization fails for the outer loop. Do not invoke
----------------
dcaballe wrote:
> rengolin wrote:
> > What's the complexity of this analysis? We'll be adding a lot of those, repeatedly, no?
> > 
> > If I understand correctly, `containsIrreducibleCFG` is not that simple, in addition to the traversal. Not calling it unnecessarily would be a nice thing to have up front.
> > 
> > How complex would it be to create the inherited attribute?
> > What's the complexity of this analysis? We'll be adding a lot of those, repeatedly, no? 
> 
> Sorry, I'm not sure I understand the question. If you mean the complexity of collecting nested loops given an outer loop, it would be linear on the number of nested loops. We wouldn't add the same loop multiple times. Each loop in the loop nest would be added and processed just once.
> 
> > If I understand correctly, containsIrreducibleCFG is not that simple, in addition to the traversal. Not calling it unnecessarily would be a nice thing to have up front. How complex would it be to create the inherited attribute?
> 
> Adding the attribute wouldn't be complicated. We would have a recursive call that processes loops from outer to inner. Once we know that the CFG of the outer loop is reducible, we can pass a flag through the recursive call to mark that nested loops are reducible and skip `containsIrreducibleCFG` for them. Does this make sense to you?
> 
> I had it implemented but there is a subtle detail that would make this patch non-NFC. If you look at line 8901, collected loops are processed in reverse order. This basically means that if we have:
> 
> ```
> #pragma clang loop vectorize
> for i
>   for j
>   
>   for k
> }
> ```
> 
> the current code would process loops k and j first. If one of them is vectorized, we couldn't vectorize 'i', the one marked for vectorization. I see two potential solutions:
> 1) Reverse the order in which loops are processed in line 8901. This is non-NFC and some existing LIT tests would have to be updated accordingly.
> 2) Collect loops at different loop nest levels in post-order and loops at the same level in pre-order. This would be the collection order for the previous example: j, k, i. This would be NFC for inner loops but I find it particularly weird.
> 
> IMO, option 1 seems the right approach but it's non-NFC and I wouldn't include it as part of this patch.
> What do you think?
> 
> 
> 
> 
> it would be linear on the number of nested loops. We wouldn't add the same loop multiple times. Each loop in the loop nest would be added and processed just once.

That's what I wanted to know, thanks. :)

> IMO, option 1 seems the right approach but it's non-NFC and I wouldn't include it as part of this patch.

Agreed.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7650
+    // TODO: VPlan-based Predication.
+    // TODO: VPlan-based Cost Modeling.
+
----------------
dcaballe wrote:
> rengolin wrote:
> > is this really the place to do predication and cost modelling?
> This function returns the VF to be used during code generation so we would need to evaluate the cost here to return the selected VF. Cost modeling shouldn't be part of the VPlan building process. The same approach is followed in the code below. Regarding predication, it must happen before the cost evaluation. We can discuss if it should belong here or not when we introduce the actual code. If the comment is confusing, I can remove it. We decided to introduce the TODOs to give a better picture of the subsequent patches but if this is not helpful or annoying I can just get rid of all of them. Please, let me know what you think.
I don't like the idea of outer-loops being in a special branch of the code, but I understand the current prototype nature of it. 

I believe it's still not the time to define what goes where that hasn't been implemented yet, so better to remove the TODOs for now, in case they lead us astray in the future. Same for the debug messages, etc.

What you should do is shortly explain why outer-loop needs "special handling", and that can be a one/two line comment in the beginning of the block.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:8227
+    // VPlanHCFGBuilder HCFGBuilder(OrigLoop, LI, *Plan.get());
+    // HCFGBuilder.buildHierarchicalCFG();
+    DEBUG(dbgs() << "LV: TODO: Build hierarchical CFG for outer loop.\n");
----------------
dcaballe wrote:
> rengolin wrote:
> > Isn't `containsIrreducibleCFG` depending on this to work?
> They are different things. `containsIrreducibleCFG` is used at the very beginning of the pass to collect potential loop candidates (without irreducible CFG) to be vectorized. `VPlanHCFGBuilder` will build a CFG out of the input IR, using the VPlan infrastructure (VPBlockBases). This VPlan CFG will be modified during the vectorization process without actually modifying the CFG of the input IR. Changes in the VPlan CFG will be materialized once the best profitable VPlan is chosen. Is it clearer now?
> 
> The VPlanHCFGBuilder is going to be introduced in the next patch.
Right, as above, don't leave commented out code hanging. Feel free to add a two-line comment in the begining of the block explaining the expectation.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:8642
+
+     // TODO: It will be set to true when we introduce codegen support.
+     return false;
----------------
hsaito wrote:
> dcaballe wrote:
> > rengolin wrote:
> > > I'm really uncomfortable with all these temporary code blocks that don't do anything...
> > > 
> > > They're really just hijacking the existing infrastructure without implementing as a VPlan.
> > > 
> > > I really thought the whole point of VPlans was that you wouldn't need to hack-it-up like we used to in the old vectoriser...
> > This is the entrance to the VPlan-native vectorization path. It's not doing anything yet because we are trying to follow an incremental approach by releasing relatively small patches that are easy to digest. This code will be functional (generating vector code) soon.
> > 
> > The code block is temporary as long as both vectorization paths co-exist but the final goal is to converge into a single one. This approach will allow us to incrementally and easily extend all the current inner loop vectorization functionality to support outer loops and, most importantly, doing so without destabilizing inner loop vectorization. We are really concerned about the latter and we think that this approach is a reasonable trade-off between safety and temporary code blocks.
> > 
> > If you want to discuss this further, I would recommend to move the discussion to the RFC thread so that everybody is aware of it: http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html
> > 
> I'm working on the "converge into a singe one" side. At this point, I'm taking care of the ground work of moving the right things to the right places such that I don't have to include those "almost NFC" things as part of "expand VPlan's participation into innermost loop vectorization". Thank you for helping me do that with your reviews. We need to be able to build VPlan for the innermost loop vectorization right after Legal, for example, before we can remove the diverged code path at the beginning. In the meantime, the outer loop vectorization patch series will help people realize how much common things are there between innermost loop vectorization and outer loop vectorization, and more importantly, help people think how to write code that can work in both ways. 
> That's as much as I want to write about the approach we are taking, within this patch review. The rest of the discussions should happen on the above mentioned RFC. Thanks.
Ok, as above, just remove the comments and add a two-line comment summarising it.

https://reviews.llvm.org/D42447