[PATCH] D19431: [LoopDist] Add llvm.loop.distribute.enable loop metadata

Adam Nemet via llvm-commits llvm-commits at lists.llvm.org
Tue Apr 26 15:01:44 PDT 2016


anemet added a comment.

In http://reviews.llvm.org/D19431#412744, @hfinkel wrote:

> In http://reviews.llvm.org/D19431#412519, @anemet wrote:
>
> > In http://reviews.llvm.org/D19431#412282, @hfinkel wrote:
> >
> > > I have no problem with adding metadata to control this, just as we do with the vectorizer. However:
> > >
> > > 1. What is preventing us from enabling this pass by default?
> >
> >
> > The big piece is profitability.  We need to prove that no ILP or MLP is impeded by distribution.  Without this we can actually regress performance even if we vectorize some of the resulting loops (in my DevMeeting talk, the hmmer loop in SPECint without LoopLoadElimination is an example for the ILP case).  I have some ideas for MLP (essentially to prove that the loop HW prefetcher friendly so misses should occur before and after).
>
>
> Can you describe this in greater detail? There are definitely cases where we want to split loops based on available hardware prefetcher resource exhaustion, but I suspect that's a separate matter. Are you concerned here with finding likely high-latency loads and making sure we can hide them (to the extent possible) with other work?


Yes that or overlapping misses from the different loads.  If you split missing loads into separate loops they are now all exposed whereas originally some were executed in the shadow of other misses.

Are you also asking about the ILP case?  I can quickly describe it here if you haven't seen the slides or the talk.

> 

> 

> > ILP is trickier because I would have to estimate the length of the critical path and see if we still have enough independent instructions in the resulting loop to avoid exposing the critical loop by much.   This may again be easier for the simple cases but for hmmer for example this needs to be fairly accurate.  Obviously we can start with the simple model initially.

> 

> 

> This all makes sense. I suspect that we should really be doing the same thing for the vectorizer's interleaving.

> 

> > We also need to ensure that the loop without the unsafe deps will be vectorized at the end.  I.e. factor out the legality and profitability checks from the vectorizer.  (There are some other efforts in this area, so this may happen independently.)

> 

> 

> Makes sense.

> 

> > And of course independently from all this the user may still want to force distribution because it's known to be profitable.  So even if the pass is on by default this feature in some form is still necessary.

> 

> 

> I agree.

> 

> > 

> 

> > 

> 

> > > 2. Do we actually need to key it this way? Given that this metadata does not force anything, and is intended to enable vectorization, why not just key this off of !{!"llvm.loop.vectorize.enable", i1 1}. That way, should the user explicitly request vectorization, and we detect that this won't be possible, we try extra hard to provide it? That is, is there a use case where a user might want #pragma clang loop distribute(enable) without also adding vectorize(enable)?

> 

> > 

> 

> > 

> 

> > That's an interesting idea but I think at the end we also want to give the user full control to know what transformation are taking place.  Imagine we had many transformations that could turn a loop from non-vectorizable into vectorizable (e.g. peeling, data-layout modifications, etc.), the user may want to choose the particular transformation rather than just "vectorize at any cost".

> 

> > 

> 

> > Also as I mentioned it to your first point, we don't actually know if we are going to vectorize the resulting loop, so tying this to vectorize.enable can be misleading.  I think at this point I am more comfortable going with the more low-level control for distribution.  What do you think?

> 

> 

> For the vectorizer, we have a much higher limit on runtime checks when the user explicitly requests vectorization. Should we do the same here? [this is the only question here actually pertinent to this change; the other discussion can be moved elsewhere as desired].


Currently there is no limit on the number memchecks we emit for loop distribution.  That said I certainly agree that when we add the threshold we should drastically increase it with the pragma just like we do it for the vectorizer.


http://reviews.llvm.org/D19431





More information about the llvm-commits mailing list