[llvm-dev] [Proposal][RFC] Epilog loop vectorization
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Mon Feb 27 07:27:35 PST 2017
On 02/27/2017 06:29 AM, Nema, Ashutosh wrote:
>
> Thanks for looking into this.
>
> 1) Issues with re running vectorizer:
>
> Vectorizer might generate redundant alias checks while vectorizing
> epilog loop.
>
> Redundant alias checks are expensive, we like to reuse the results of
> already computed alias checks.
>
> With metadata we can limit the width of epilog loop, but not sure
> about reusing alias check result.
>
> Any thoughts on rerunning vectorizer with reusing the alias check result ?
>
One way of looking at this is: Reusing the alias-check result is really
just a conditional propagation problem; if we don't already have an
optimization that can combine these after the fact, then we should.
-Hal
> 2) Best & worst case for epilog vectorization.
>
> Epilog vectorization incurs additional cost of checks which decides to
> execute either epilog vector loop or scalar loop.
>
> These checks are:
>
> a)Min trip count check for epilog vector loop.
>
> b)Alias result check (only if alias check is generated for first
> vector version)
>
> Where the epilog vector trip count is large epilog vectorization with
> profitable width likely to give gain. Worst case probably after these
> additional checks executing the scalar loop.
>
> 3) Benchmarking & results:
>
> We have observed 4% improvement in one of our internal benchmark.
>
> Tried CPU2006 and didn’t found any degrade & improvements.
>
> On code size increase I have not yet verified yet will check and get back.
>
> 4) What should be the minimum first vector loop width to enable
> epilog vectorization.
>
> This count should not be small as it may degrade the performance, with
> my limited tests I have observed 16 is a point it shows gains with one
> of our internal benchmark. This require more experiments & testing to
> decide what should be the minimum width.
>
> 5) Unrolling issues:
>
> As Ayal mentioned with large unroll factor the next profitable
> EpilogVF could be equal to VF.
>
> With the same reason the current patch enforces UF=1, as unrolling can
> minimize the possibility of executing epilog vector loop.
>
> Example to understand the new layout:
>
> void foo (char *A, char *B, char *C, int len) {
>
> int i = 0;
>
> for (i=0 ; i< len; i++)
>
> A[i] = B[i] + C[i];
>
> }
>
> This loop get vectorize with width 32 and epilog loop get vectorized
> with width 8.
>
> a)Please check attached IR(test.ll)
>
> b)To understand how alias check result got used, check temporary
> “acr.val” usage.
>
> Regards,
>
> Ashutosh
>
> *From:*Zaks, Ayal [mailto:ayal.zaks at intel.com]
> *Sent:* Sunday, February 26, 2017 10:53 PM
> *To:* Hal Finkel <hfinkel at anl.gov>; Adam Nemet <anemet at apple.com>;
> Nema, Ashutosh <Ashutosh.Nema at amd.com>
> *Cc:* llvm-dev <llvm-dev at lists.llvm.org>
> *Subject:* RE: [llvm-dev] [Proposal][RFC] Epilog loop vectorization
>
> +1 for “just rerun the vectorizer” on the scalar remainder loop, as
> the proposed decision process is broken down into “first determine
> best VF for main loop, then determine best next EpilogVF for remainder
> loop” anyhow:
>
> const LoopVectorizationCostModel::VectorizationFactor EpilogVF =
>
> CM.identifyNextProfitableVF(VF.Width);
>
> Raising some aspects:
>
> o The unroll factor may also affect the best EpilogVF. For instance,
> if UF=1 then EpilogVF < VF, as the patch currently enforces. But if UF
> is larger the next profitable EpilogVF could be equal to VF.
>
> o The scalar loop serves two purposes, as determined by its two
> pre-headers: either as a default in case runtime dependence checks
> fail, or as a remainder loop in which case it is known to be
> vectorizable with trip count less-than VF*UF (or equal-to it*). Would
> be good to keep this in mind when rerunning.
>
> (*) Note that if original loop requiresScalarEpilogue(), the trip
> count of the remainder loop may be equal to VF*UF, and/or the
> vectorized remainder loop may too require a scalar epilogue.
>
> o It should be interesting to see how a single masked iteration that
> uses the original VF, say for UF=1, works. LV should hopefully already
> support most of what’s needed.
>
> o The original Trip Count clearly affects the profitability of
> vectorizing the remainder loop. Would be good to leverage any
> information that can be derived about TC either statically or based on
> profiling, when determining EpilogVF. Potential speedups and
> overheads/slowdowns could possibly be demonstrated using extreme
> cases; what would the best and worst cases be? Perhaps
> TinyTripCountVectorThreshold is also affected?
>
> o Finally, VPlan is currently modeling how to vectorize the loop body
> according to the potentially profitable VF’s. It’s modelling could be
> used to generate vector code for both body and remainder, and to
> consider their combined, overall cost.
>
> Ayal.
>
> *From:*llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of
> *Hal Finkel via llvm-dev
> *Sent:* Friday, February 24, 2017 00:14
> *To:* Adam Nemet <anemet at apple.com <mailto:anemet at apple.com>>; Nema,
> Ashutosh <Ashutosh.Nema at amd.com <mailto:Ashutosh.Nema at amd.com>>
> *Cc:* llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
> *Subject:* Re: [llvm-dev] [Proposal][RFC] Epilog loop vectorization
>
> On 02/22/2017 11:52 AM, Adam Nemet via llvm-dev wrote:
>
> Hi Ashutosh,
>
> On Feb 22, 2017, at 1:57 AM, Nema, Ashutosh via llvm-dev
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
> Hi,
>
> This is a proposal about epilog loop vectorization.
>
> Currently Loop Vectorizer inserts an epilogue loop for
> handling loops that don’t have known iteration counts.
>
> The Loop Vectorizer supports loops with an unknown trip count,
> unknown trip count may not be a multiple of the vector width,
> and the vectorizer has to execute the last few iterations as
> scalar code. It keeps a scalar copy of the loop for the
> remaining iterations.
>
> Loop with the large width has a high possibility of executing
> many scalar iterations.
>
> i.e. i8 data type with 256bits target register can vectorize
> with vector width 32, with that maximum trip count possibility
> for scalar(epilog) loop is 31, which is significant & worth
> vectorizing.
>
> Large vector factor has following challenges:
>
> 1)Possibility of remainder iteration is substantial.
>
> 2)Actual trip count at runtime is substantial but not meeting
> minimum trip count to execute vector loop.
>
> These challenges can be addressed by mask instructions, but
> these instructions are limited and may not be available to all
> targets.
>
> By epilog vectorization our aim to vectorize epilog loop where
> original loop is vectorized with large vector factor and has a
> high possibility of executing scalar iterations.
>
> This require following changes:
>
> 1)Costing: Preserve all profitable vector factor.
>
> 2)Transform: Create an additional vector loop with next
> profitable vector factor.
>
> Is this something that you propose to be on by default for wide
> VPU architectures without masking support? I.e. how widely is this
> applicable? If not then perhaps a better strategy would be to just
> annotate the remainder loop with some metadata to limit the
> vectorization factor and just rerun the vectorizer.
>
>
> Why would this solution (annotating the remainder loop to limit
> vectorization and rerunning the vectorization process) not be
> preferred regardless?
>
> One issue that might be relevant here are runtime aliasing checks,
> which are probably going to be redundant, or partially redundant,
> between the different vectorized versions. Will we be able to do any
> necessary cleanup after the fact? Moreover, we often don't hoist
> (unswitch) these checks out of inner loops (perhaps because they're
> inside the trip-count checks?); I wonder if the proposed block
> structure will make this situation better or worse (or have no overall
> effect).
>
> Thanks again,
> Hal
>
> Adam
>
> Please refer attached file (BlockLayout.png) for the details
> about transformed block layout.
>
> Patch is available at:https://reviews.llvm.org/D30247
>
> Regards,
>
> Ashutosh
>
> <BlockLayout.png>_______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> _______________________________________________
>
> LLVM Developers mailing list
>
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170227/ecbbe11f/attachment.html>
More information about the llvm-dev
mailing list