[llvm-dev] [Proposal][RFC] Epilog loop vectorization

Wed Mar 15 03:55:06 PDT 2017

From: Zaks, Ayal [mailto:ayal.zaks at intel.com]
Sent: Wednesday, March 15, 2017 4:39 AM
To: Nema, Ashutosh <Ashutosh.Nema at amd.com>; anemet at apple.com; Hal Finkel <hfinkel at anl.gov>; Renato Golin <renato.golin at linaro.org>; mkuper at google.com; Mehdi Amini <mehdi.amini at apple.com>; Daniel Berlin <dberlin at dberlin.org>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Subject: RE: [llvm-dev] [Proposal][RFC] Epilog loop vectorization

From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
Summarizing the discussion on the implementation approaches.

Discussed about two approaches, first running ‘InnerLoopVectorizer’ again on the epilog loop immediately after vectorizing the original loop within the same vectorization pass, the second approach where re-running vectorization pass and limiting vectorization factor of epilog loop by metadata.

<Approach-2>
Challenges with re-running the vectorizer pass:

1)      Reusing alias check result:

When vectorizer pass runs again it finds the epilog loop as a new loop and it may generates alias check, this new alias check may overkill the gains of epilog vectorization.

We should use the already computed alias check result instead of re computing again.

Right, can this challenge be addressed – can we record the “simple” fact that the epilog loop is vectorizable with trip count at-most VF*UF when reached from the vectorized loop? This is akin to passing similar information from the front-end when supplied by, e.g., OpenMP pragmas, with the additional path-sensitive context attached.

I did not get this point completely. Yes, we can record the maximum width for epilog vectorization but what you meant by “path-sensitive context attached”.
Please elaborate more on this and how does it help in reusing alias check result ?

Agreed, if each loop is handled independently, the multiple minimum-trip-count tests should be revisited to optimize for smallest trip-count first.

If the main loop was vectorized by VF and unrolled by UF>1, it may be reasonable to vectorize the remainder loop with the same VF (w/o unrolling).

And then possibly vectorize the remainder of that with a smaller, say, VF/2. In addition, situations having small types and large vectors may result in large VF, again leaving room for possibly repeated epilog vectorizations with decreasing VF’s. At some point it would be good to try the alternative of a (final) masked vector epilog.

Each vector version incurs extra cost by adding extra checks, considering this fact I have limit the patch to only generate one epilog vector version.

We can generate multiple epilog versions but we have to understand the tradeoff of generating them. Once we have the proper costing of checks we can make more precise decisions. I like to defer this for later enhancements.

Masked instructions are available is AVX512 and of course it’s better solution then this. But architectures which does not have masked instruction support epilog vector version is one of the technique to vectorize epilog iterations.

Ayal.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170315/60030e08/attachment.html>