[llvm-dev] [Proposal][RFC] Epilog loop vectorization
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Tue Mar 14 12:44:29 PDT 2017
On 03/14/2017 02:37 PM, Michael Kuperstein wrote:
>
>
> On Tue, Mar 14, 2017 at 11:40 AM, Hal Finkel <hfinkel at anl.gov
> <mailto:hfinkel at anl.gov>> wrote:
>
>
>
> On 03/14/2017 11:58 AM, Michael Kuperstein wrote:
>> I'm still not sure about this, for a few reasons:
>>
>> 1) I'd like to try to treat epilogue loops the same way
>> regardless of whether the main loop was vectorized by hand or
>> automatically. So if someone hand-wrote an avx-512 16-wide loop,
>> with alias checks, and we decide it's profitable to vectorize the
>> epilogue loop by 4 and re-use the checks, it ought to be done the
>> same way. I realize this may be a pipe-dream, though.
>
> I agree that sounds ideal. Identifying the effective vectorization
> factor of the hand-vectorized loop seems fragile. However, if
> someone is hand vectorizing then it seems like a small price to
> add a pragma to the scalar loop restricting the vectorization
> factor (and/or specifying that it is safe to execute). As a
> result, I'm not sure how much effort we should make here.
>
>
> I semi-agree with you, which is why I don't think it's a blocker. :-)
>
>>
>> 2) I'm still somewhat worried about "tiny loops". As I wrote
>> before, we explicitly refuse to vectorize loops we know have a
>> trip-count less than 16, because our profitability heuristic for
>> such loops is probably bad. IIUC the only reason we don't bail
>> due to the threshold is because we use the same loop for "failed
>> min iters check" and "failed alias check". So, because it's
>> reachable through the alias-check path, the max trip count isn't
>> actually known, even though the typical trip count is probably
>> small.
>> It's true that you currently don't try to vectorize the epilogue
>> if the original VF is below 16, but this is a somewhat different
>> condition.
>
> I think that the reason we have that limit, however, is that we
> don't model the costs of the checks. Not that the cost model is
> otherwise too inaccurate for low-trip-count loops. If we modeled
> the costs of the checks, then I don't think this would be a problem.
>
>
> I don't think it's just the alias checks. There's also the
> min-iteration check,
Yes, we'd need to model the cost of all of the checks (trip count,
aliasing, overflow, etc.).
> broadcasts that get hoisted out of the loop, etc.
Can you elaborate? Do you mean that we're hoisting broadcasts outside of
the trip-count check (i.e. speculatively executing them) and these are
actually expensive operations? Is this is a cost-model problem being
exposed by SimplifyCFG (which is hoisting these things outside of the
trip-count check)?
-Hal
>>
>> 3) Technically speaking, constructing a new InnerLoopVectorizer
>> to vectorize this one loop sounds weird. We already have a
>> worklist in the vectorizer that's currently running.
>
> I agree, although we do want to reuse the cost and legality
> analysis (which I think is a worthwhile engineering decision
> because that analysis involves SCEV, AA, and TTI, all of which can
> get expensive). If we can do that and also just add the new loop
> to the work queue, that certainly might be cleaner.
>
> -Hal
>
>
>>
>> I don't think (1) is a blocker, and (3) should be easy to fix,
>> but I'm not sure whether the way this is going to handle (2) is
>> sufficient. If I'm the only one that this bothers, I won't stand
>> in the way, but I'd like to at least make sure we've fully
>> considered this.
>>
>> On Mar 14, 2017 06:00, "Nema, Ashutosh" <Ashutosh.Nema at amd.com
>> <mailto:Ashutosh.Nema at amd.com>> wrote:
>>
>> Summarizing the discussion on the implementation approaches.
>>
>> Discussed about two approaches, first running
>> ‘InnerLoopVectorizer’ again on the epilog loop immediately
>> after vectorizing the original loop within the same
>> vectorization pass, the second approach where re-running
>> vectorization pass and limiting vectorization factor of
>> epilog loop by metadata.
>>
>> <Approach-2>
>>
>> Challenges with re-running the vectorizer pass:
>>
>> 1)Reusing alias check result:
>>
>> When vectorizer pass runs again it finds the epilog loop as a
>> new loop and it may generates alias check, this new alias
>> check may overkill the gains of epilog vectorization.
>>
>> We should use the already computed alias check result instead
>> of re computing again.
>>
>> 2)Rerun the vectorizer and hoist the new alias check:
>>
>> It’s not possible to hoist alias checks as its not fully
>> redundant (not dominated by other checks), it’s not getting
>> execute in all paths.
>>
>> NOTE: We cannot prepone alias check as its expensive compared
>> to other checks.
>>
>> <Approach-1>
>>
>> 1)Current patch depends on the existing functionality of
>> LoopVectorizer, it uses ‘InnerLoopVectorizer’ again to
>> vectorize the epilog loop, as it happens in the same
>> vectorization pass we have flexibility to reuse already
>> computed alias result check & limit vectorization factor for
>> the epilog loop.
>>
>> 2)It does not generate the blocks for new block layout
>> explicitly, rather it depends on
>> ‘InnerLoopVectorizer::createEmptyLoop’ to generate new block
>> layout. The new block layout get automatically generated by
>> calling the ‘InnerLoopVectorizer:: vectorize’ again.
>>
>> 3)Block layout description with epilog loop vectorization is
>> available at
>>
>> https://reviews.llvm.org/file/data/fxg5vx3capyj257rrn5j/PHID-FILE-x6thnbf6ub55ep5yhalu/LayoutDescription.png
>> <https://reviews.llvm.org/file/data/fxg5vx3capyj257rrn5j/PHID-FILE-x6thnbf6ub55ep5yhalu/LayoutDescription.png>
>>
>> Approach-1 looks feasible, please comment if any objections.
>>
>> Regards,
>>
>> Ashutosh
>>
>> *From:*Nema, Ashutosh
>> *Sent:* Wednesday, March 1, 2017 10:42 AM
>> *To:* 'Daniel Berlin' <dberlin at dberlin.org
>> <mailto:dberlin at dberlin.org>>
>> *Cc:* anemet at apple.com <mailto:anemet at apple.com>; Hal Finkel
>> <hfinkel at anl.gov <mailto:hfinkel at anl.gov>>; Zaks, Ayal
>> <ayal.zaks at intel.com <mailto:ayal.zaks at intel.com>>; Renato
>> Golin <renato.golin at linaro.org
>> <mailto:renato.golin at linaro.org>>; mkuper at google.com
>> <mailto:mkuper at google.com>; Mehdi Amini
>> <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>>;
>> llvm-dev <llvm-dev at lists.llvm.org
>> <mailto:llvm-dev at lists.llvm.org>>
>> *Subject:* RE: [llvm-dev] [Proposal][RFC] Epilog loop
>> vectorization
>>
>> Sorry I misunderstood, gvn/newgvn/gvnhoist cannot help here
>> as these checks are not dominated by all paths.
>>
>> Regards,
>>
>> Ashutosh
>>
>> *From:*Daniel Berlin [mailto:dberlin at dberlin.org]
>> *Sent:* Tuesday, February 28, 2017 6:58 PM
>> *To:* Nema, Ashutosh <Ashutosh.Nema at amd.com
>> <mailto:Ashutosh.Nema at amd.com>>
>> *Cc:* anemet at apple.com <mailto:anemet at apple.com>; Hal Finkel
>> <hfinkel at anl.gov <mailto:hfinkel at anl.gov>>; Zaks, Ayal
>> <ayal.zaks at intel.com <mailto:ayal.zaks at intel.com>>; Renato
>> Golin <renato.golin at linaro.org
>> <mailto:renato.golin at linaro.org>>; mkuper at google.com
>> <mailto:mkuper at google.com>; Mehdi Amini
>> <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>>;
>> llvm-dev <llvm-dev at lists.llvm.org
>> <mailto:llvm-dev at lists.llvm.org>>
>> *Subject:* Re: [llvm-dev] [Proposal][RFC] Epilog loop
>> vectorization
>>
>> Hoisting or removing?
>> Neither pass does hoisting, you'd need gvnhoist for that.
>>
>> Even then:
>> Staring at your example, none of the checks are fully
>> redundant (IE they are not dominated by other checks) or
>> execute on all paths.
>>
>> Thus, hoisting them would be purely speculative code motion,
>> which none of our passes do.
>>
>> If you would like these sets of checks to be removed, you
>> would need to place them in a place that they execute
>> unconditionally.
>>
>> Otherwise, this is not a standard code hoisting/removal
>> transform.
>>
>> The only redundancy i can see here at all is the repeated
>> getelementptr computation.
>>
>> If you move it to the preheader, it will be eliminated.
>>
>> Otherwise, none of the checks are redundant.
>>
>>
>> What would you hope to happen in this case?
>>
>> On Tue, Feb 28, 2017 at 5:09 AM, Nema, Ashutosh
>> <Ashutosh.Nema at amd.com <mailto:Ashutosh.Nema at amd.com>> wrote:
>>
>> I have tried running both gvn and newgvn but it did not
>> helped in hoisting the alias checks:
>>
>> Please check, maybe I have missed something.
>>
>> <TestCase>
>>
>> void foo (char *A, char *B, char *C, int len) {
>>
>> int i = 0;
>>
>> for (i=0 ; i< len; i++)
>>
>> A[i] = B[i] + C[i];
>>
>> }
>>
>> <Command>
>>
>> $ opt –O3 –gvn test.ll –o test.opt.ll
>>
>> $ opt –O3 –newgvn test.ll –o test.opt.ll
>>
>> “test.ll” is attached, it got already vectorized by the
>> approach running vectorizer twice by annotate the
>> remainder loop with metadata to limit the vectorization
>> factor for epilog vector loop.
>>
>> Regards,
>>
>> Ashutosh
>>
>> *From:*anemet at apple.com <mailto:anemet at apple.com>
>> [mailto:anemet at apple.com <mailto:anemet at apple.com>]
>> *Sent:* Tuesday, February 28, 2017 1:33 AM
>> *To:* Hal Finkel <hfinkel at anl.gov <mailto:hfinkel at anl.gov>>
>> *Cc:* Daniel Berlin <dberlin at dberlin.org
>> <mailto:dberlin at dberlin.org>>; Nema, Ashutosh
>> <Ashutosh.Nema at amd.com <mailto:Ashutosh.Nema at amd.com>>;
>> Zaks, Ayal <ayal.zaks at intel.com
>> <mailto:ayal.zaks at intel.com>>; Renato Golin
>> <renato.golin at linaro.org
>> <mailto:renato.golin at linaro.org>>; mkuper at google.com
>> <mailto:mkuper at google.com>; Mehdi Amini
>> <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>>;
>> llvm-dev <llvm-dev at lists.llvm.org
>> <mailto:llvm-dev at lists.llvm.org>>
>> *Subject:* Re: [llvm-dev] [Proposal][RFC] Epilog loop
>> vectorization
>>
>> On Feb 27, 2017, at 12:01 PM, Hal Finkel
>> <hfinkel at anl.gov <mailto:hfinkel at anl.gov>> wrote:
>>
>> On 02/27/2017 01:47 PM, Daniel Berlin wrote:
>>
>> On Mon, Feb 27, 2017 at 11:29 AM, Adam Nemet
>> <anemet at apple.com <mailto:anemet at apple.com>> wrote:
>>
>> On Feb 27, 2017, at 10:11 AM, Hal Finkel
>> <hfinkel at anl.gov
>> <mailto:hfinkel at anl.gov>> wrote:
>>
>> On 02/27/2017 11:47 AM, Adam Nemet wrote:
>>
>> On Feb 27, 2017, at 9:39 AM,
>> Daniel Berlin
>> <dberlin at dberlin.org
>> <mailto:dberlin at dberlin.org>> wrote:
>>
>> On Mon, Feb 27, 2017 at 9:29 AM,
>> Adam Nemet <anemet at apple.com
>> <mailto:anemet at apple.com>> wrote:
>>
>> On Feb 27, 2017, at 7:27
>> AM, Hal Finkel
>> <hfinkel at anl.gov
>> <mailto:hfinkel at anl.gov>>
>> wrote:
>>
>>
>> On 02/27/2017 06:29 AM,
>> Nema, Ashutosh wrote:
>>
>> Thanks for looking
>> into this.
>>
>> 1) Issues with re
>> running vectorizer:
>>
>> Vectorizer might
>> generate redundant
>> alias checks while
>> vectorizing epilog loop.
>>
>> Redundant alias
>> checks are expensive,
>> we like to reuse the
>> results of already
>> computed alias checks.
>>
>> With metadata we can
>> limit the width of
>> epilog loop, but not
>> sure about reusing
>> alias check result.
>>
>> Any thoughts on
>> rerunning vectorizer
>> with reusing the
>> alias check result ?
>>
>>
>> One way of looking at
>> this is: Reusing the
>> alias-check result is
>> really just a conditional
>> propagation problem; if
>> we don't already have an
>> optimization that can
>> combine these after the
>> fact, then we should.
>>
>> +Danny
>>
>> Isn’t Extended SSA supposed
>> to help with this?
>>
>> Yes, it will solve this with no
>> issue already. GVN probably does
>> already too.
>>
>> even if if you have
>>
>> if (a == b)
>>
>> if (a == c)
>>
>> if (a == d)
>>
>> if (a == e)
>>
>> if (a == g)
>>
>> and we can prove a ... g
>> equivalent, newgvn will eliminate
>> them all and set all the branches
>> true.
>>
>> If you need a simpler clean up
>> pass, we could run it on sub-graphs.
>>
>> Yes we probably don’t want to run a
>> full GVN after the “loop-scheduling”
>> passes.
>>
>>
>> FWIW, we could, just without the
>> memory-dependence analysis enabled (i.e.
>> set the NoLoads constructor parameter to
>> true). GVN is pretty fast in that mode.
>>
>> OK. Another data point is that I’ve seen
>> cases in the past where the alias checks
>> required for the loop passes could enable GVN
>> to remove redundant loads/stores. Currently
>> we can only pick these up with LTO when GVN
>> is rerun.
>>
>> This is just GVN brokenness, newgvn should not
>> have this problem.
>>
>> If it does, i'd love to see it.
>>
>>
>> I thought that the problem is that we just don't run
>> GVN after that point in the pipeline.
>>
>> Yeah, that is the problem but I think Danny misunderstood
>> what I was trying to say.
>>
>> This was a datapoint to possibly rerun GVN with
>> memory-awareness.
>>
>>
>> -Hal
>>
>> (I'm working on the last few parts of turning it
>> on by default, but it requires a new
>> getModRefInfo interface to be able to get the
>> last few testcases)
>>
>> --
>>
>> Hal Finkel
>>
>> Lead, Compiler Technology and Programming Languages
>>
>> Leadership Computing Facility
>>
>> Argonne National Laboratory
>>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170314/daf47c89/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 18444 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170314/daf47c89/attachment-0001.jpe>
More information about the llvm-dev
mailing list