[llvm-dev] [Proposal][RFC] Epilog loop vectorization

Mon Feb 27 07:27:35 PST 2017

On 02/27/2017 06:29 AM, Nema, Ashutosh wrote:
>
> Thanks for looking into this.
>
> 1) Issues with re running vectorizer:
>
> Vectorizer might generate redundant alias checks while vectorizing 
> epilog loop.
>
> Redundant alias checks are expensive, we like to reuse the results of 
> already computed alias checks.
>
> With metadata we can limit the width of epilog loop, but not sure 
> about reusing alias check result.
>
> Any thoughts on rerunning vectorizer with reusing the alias check result ?
>

One way of looking at this is: Reusing the alias-check result is really 
just a conditional propagation problem; if we don't already have an 
optimization that can combine these after the fact, then we should.

  -Hal

> 2) Best & worst case for epilog vectorization.
>
> Epilog vectorization incurs additional cost of checks which decides to 
> execute either epilog vector loop or scalar loop.
>
> These checks are:
>
> a)Min trip count check for epilog vector loop.
>
> b)Alias result check (only if alias check is generated for first 
> vector version)
>
> Where the epilog vector trip count is large epilog vectorization with 
> profitable width likely to give gain. Worst case probably after these 
> additional checks executing the scalar loop.
>
> 3) Benchmarking & results:
>
> We have observed 4% improvement in one of our internal benchmark.
>
> Tried CPU2006 and didn’t found any degrade & improvements.
>
> On code size increase I have not yet verified yet will check and get back.
>
> 4)  What should be the minimum first vector loop width to enable 
> epilog vectorization.
>
> This count should not be small as it may degrade the performance, with 
> my limited tests I have observed 16 is a point it shows gains with one 
> of our internal benchmark. This require more experiments & testing to 
> decide what should be the minimum width.
>
> 5) Unrolling issues:
>
> As Ayal mentioned with large unroll factor the next profitable 
> EpilogVF could be equal to VF.
>
> With the same reason the current patch enforces UF=1, as unrolling can 
> minimize the possibility of executing epilog vector loop.
>
> Example to understand the new layout:
>
> void foo (char *A, char *B, char *C, int len) {
>
>   int i = 0;
>
>   for (i=0 ; i< len; i++)
>
>     A[i] = B[i] + C[i];
>
> }
>
> This loop get vectorize with width 32 and epilog loop get vectorized 
> with width 8.
>
> a)Please check attached IR(test.ll)
>
> b)To understand how alias check result got used, check temporary 
> “acr.val” usage.
>
> Regards,
>
> Ashutosh
>
> *From:*Zaks, Ayal [mailto:ayal.zaks at intel.com]
> *Sent:* Sunday, February 26, 2017 10:53 PM
> *To:* Hal Finkel <hfinkel at anl.gov>; Adam Nemet <anemet at apple.com>; 
> Nema, Ashutosh <Ashutosh.Nema at amd.com>
> *Cc:* llvm-dev <llvm-dev at lists.llvm.org>
> *Subject:* RE: [llvm-dev] [Proposal][RFC] Epilog loop vectorization
>
> +1 for “just rerun the vectorizer” on the scalar remainder loop, as 
> the proposed decision process is broken down into “first determine 
> best VF for main loop, then determine best next EpilogVF for remainder 
> loop” anyhow:
>
>    const LoopVectorizationCostModel::VectorizationFactor EpilogVF =
>
>        CM.identifyNextProfitableVF(VF.Width);
>
> Raising some aspects:
>
> o The unroll factor may also affect the best EpilogVF. For instance, 
> if UF=1 then EpilogVF < VF, as the patch currently enforces. But if UF 
> is larger the next profitable EpilogVF could be equal to VF.
>
> o The scalar loop serves two purposes, as determined by its two 
> pre-headers: either as a default in case runtime dependence checks 
> fail, or as a remainder loop in which case it is known to be 
> vectorizable with trip count less-than VF*UF (or equal-to it*). Would 
> be good to keep this in mind when rerunning.
>
> (*) Note that if original loop requiresScalarEpilogue(), the trip 
> count of the remainder loop may be equal to VF*UF, and/or the 
> vectorized remainder loop may too require a scalar epilogue.
>
> o It should be interesting to see how a single masked iteration that 
> uses the original VF, say for UF=1, works. LV should hopefully already 
> support most of what’s needed.
>
> o The original Trip Count clearly affects the profitability of 
> vectorizing the remainder loop. Would be good to leverage any 
> information that can be derived about TC either statically or based on 
> profiling, when determining EpilogVF. Potential speedups and 
> overheads/slowdowns could possibly be demonstrated using extreme 
> cases; what would the best and worst cases be? Perhaps 
> TinyTripCountVectorThreshold is also affected?
>
> o Finally, VPlan is currently modeling how to vectorize the loop body 
> according to the potentially profitable VF’s. It’s modelling could be 
> used to generate vector code for both body and remainder, and to 
> consider their combined, overall cost.
>
> Ayal.
>
> *From:*llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of 
> *Hal Finkel via llvm-dev
> *Sent:* Friday, February 24, 2017 00:14
> *To:* Adam Nemet <anemet at apple.com <mailto:anemet at apple.com>>; Nema, 
> Ashutosh <Ashutosh.Nema at amd.com <mailto:Ashutosh.Nema at amd.com>>
> *Cc:* llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
> *Subject:* Re: [llvm-dev] [Proposal][RFC] Epilog loop vectorization
>
> On 02/22/2017 11:52 AM, Adam Nemet via llvm-dev wrote:
>
>     Hi Ashutosh,
>
>         On Feb 22, 2017, at 1:57 AM, Nema, Ashutosh via llvm-dev
>         <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>         Hi,
>
>         This is a proposal about epilog loop vectorization.
>
>         Currently Loop Vectorizer inserts an epilogue loop for
>         handling loops that don’t have known iteration counts.
>
>         The Loop Vectorizer supports loops with an unknown trip count,
>         unknown trip count may not be a multiple of the vector width,
>         and the vectorizer has to execute the last few iterations as
>         scalar code. It keeps a scalar copy of the loop for the
>         remaining iterations.
>
>         Loop with the large width has a high possibility of executing
>         many scalar iterations.
>
>         i.e. i8 data type with 256bits target register can vectorize
>         with vector width 32, with that maximum trip count possibility
>         for scalar(epilog) loop is 31, which is significant & worth
>         vectorizing.
>
>         Large vector factor has following challenges:
>
>         1)Possibility of remainder iteration is substantial.
>
>         2)Actual trip count at runtime is substantial but not meeting
>         minimum trip count to execute vector loop.
>
>         These challenges can be addressed by mask instructions, but
>         these instructions are limited and may not be available to all
>         targets.
>
>         By epilog vectorization our aim to vectorize epilog loop where
>         original loop is vectorized with large vector factor and has a
>         high possibility of executing scalar iterations.
>
>         This require following changes:
>
>         1)Costing: Preserve all profitable vector factor.
>
>         2)Transform: Create an additional vector loop with next
>         profitable vector factor.
>
>     Is this something that you propose to be on by default for wide
>     VPU architectures without masking support? I.e. how widely is this
>     applicable? If not then perhaps a better strategy would be to just
>     annotate the remainder loop with some metadata to limit the
>     vectorization factor and just rerun the vectorizer.
>
>
> Why would this solution (annotating the remainder loop to limit 
> vectorization and rerunning the vectorization process) not be 
> preferred regardless?
>
> One issue that might be relevant here are runtime aliasing checks, 
> which are probably going to be redundant, or partially redundant, 
> between the different vectorized versions. Will we be able to do any 
> necessary cleanup after the fact? Moreover, we often don't hoist 
> (unswitch) these checks out of inner loops (perhaps because they're 
> inside the trip-count checks?); I wonder if the proposed block 
> structure will make this situation better or worse (or have no overall 
> effect).
>
> Thanks again,
> Hal
>
>     Adam
>
>         Please refer attached file (BlockLayout.png) for the details
>         about transformed block layout.
>
>         Patch is available at:https://reviews.llvm.org/D30247
>
>         Regards,
>
>         Ashutosh
>
>         <BlockLayout.png>_______________________________________________
>         LLVM Developers mailing list
>         llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
>     _______________________________________________
>
>     LLVM Developers mailing list
>
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170227/ecbbe11f/attachment.html>