[PATCH] D30732: LoopVectorizer: let target limit memory intensive loops

Thu Mar 9 07:22:38 PST 2017

hfinkel added a comment.

In https://reviews.llvm.org/D30732#696561, @uweigand wrote:

> In https://reviews.llvm.org/D30732#696479, @hfinkel wrote:
>
> > Please be very clear. Let me explain:
> >
> > 1. Imagine a processor that has a special cache for loop instructions, and this loop cache can only hold N stores. In this case, it is really important that the loop have no more than N store instructions.
> > 2. Imagine a processor that cannot sustain one store/cycle because of limited bandwidth or other resources. Over N cycles, only M (< N) stores can be processed. This might or might not be related to the size of the store.
> >
> >   I think you're in situation (2) where the size of the store does not matter (since you described some kind of tag resource). In that case, please don't count stores in the loop. You'll want to get the anticipated loop cost (which is the closest thing we have to a cycle count) for each VF and go from there. You might also limit the VF based on the number of stores that would appear in a row, because that will always exceed your limit and stall something.
> >
> >   If you're in (1), then your limit probably makes sense only for loops that will fit in the relevant cache.
> >
> >   In short, vectorized loops can be quite large and, whatever you're doing, just counting stores sounds like an insufficiently-precise model.
>
>
> The specific issue is neither quite (1) nor (2) as described above, but rather the following: while stores are "in flight" during the out-of-order execution, each store needs to be tagged with a store tag, which is used to make sure the out-of-order execution doesn't violate ordering constraints implied by the original instruction order.  There is only a limited number of those store tags available, and if this is exceeded, the processor may have to flush the pipeline.
>
> To avoid this, we want to ensure that at any point, the number of stores "in flight" simultaneously does not exceed this limit.  The problem is that it is hard to predict *precisely* when each store is actually in flight, this depends on decisions of the out-of-order engine (e.g. when which instruction is issued) that are in general impossibly to predict in the compiler.  So we have to use some heuristic to achieve the effect we want in most cases.
>
> When we originally looked into this issue, this was in the context of loop unrolling.  We wanted to unroll loops, in particular small loops, in order to increase overall the execution rate, e.g. due to reducing the total number of branches (and thereby pressure on the branch prediction subsystem).  However, in some cases we noted that the unrolled loop performed worse than the original one.  Investigation showed that due to the overall higher execution rate the number of stores simultaneously in flight was also increased, and now exceeded the limit of store tags, causing pipeline flushes.
>
> To fix this, we wanted to stop loop unrolling before that happened, and tried to design a heuristic to detect that case.  Using a heuristic modeled after your situation (2) didn't work well, since the ratio of stores to total instructions is mostly unaffected by unrolling; this metric doesn't track the secondary effects that were causing the difference we're seeing (e.g. prediction rate of the loop backwards branch).   So we ended up with a simple heuristic modeled after your situation (1), i.e. simply don't unroll loops (any further) if the total number of stores in the loop exceeds a threshold.
>
> Now of course in a very large loop, the presence of more stores than this threshold would be unlikely to cause any problems, and in that sense the heuristic is very conservative.  However, for the purpose of loop unrolling this was acceptable, since for large loops it generally doesn't matter anyway if they're unrolled or not -- only for small loops is it really critical.
>
> However, I can see your point that for a *vectorization* decision, this last statement is no longer true; we certainly do want to vectorize large loops!  Jonas, I think this means we may indeed have to do something better than just counting the total number of stores in the loop ...   (As always with these heuristics, this probably means tuning the parameters via actual measurements.)

This makes a lot of sense to me, thanks for explaining!

https://reviews.llvm.org/D30732