[llvm-dev] [Proposal][RFC] Epilog loop vectorization

Tue Mar 14 12:37:59 PDT 2017

On Tue, Mar 14, 2017 at 11:40 AM, Hal Finkel <hfinkel at anl.gov> wrote:

>
>
> On 03/14/2017 11:58 AM, Michael Kuperstein wrote:
>
> I'm still not sure about this, for a few reasons:
>
> 1) I'd like to try to treat epilogue loops the same way regardless of
> whether the main loop was vectorized by hand or automatically. So if
> someone hand-wrote an avx-512 16-wide loop, with alias checks, and we
> decide it's profitable to vectorize the epilogue loop by 4 and re-use the
> checks, it ought to be done the same way. I realize this may be a
> pipe-dream, though.
>
>
> I agree that sounds ideal. Identifying the effective vectorization factor
> of the hand-vectorized loop seems fragile. However, if someone is hand
> vectorizing then it seems like a small price to add a pragma to the scalar
> loop restricting the vectorization factor (and/or specifying that it is
> safe to execute). As a result, I'm not sure how much effort we should make
> here.
>
>
I semi-agree with you, which is why I don't think it's a blocker. :-)

>
> 2) I'm still somewhat worried about "tiny loops". As I wrote before, we
> explicitly refuse to vectorize loops we know have a trip-count less than
> 16, because our profitability heuristic for such loops is probably bad.
> IIUC the only reason we don't bail due to the threshold is because we use
> the same loop for "failed min iters check" and "failed alias check". So,
> because it's reachable through the alias-check path, the max trip count
> isn't actually known, even though the typical trip count is probably small.
> It's true that you currently don't try to vectorize the epilogue if the
> original VF is below 16, but this is a somewhat different condition.
>
>
> I think that the reason we have that limit, however, is that we don't
> model the costs of the checks. Not that the cost model is otherwise too
> inaccurate for low-trip-count loops. If we modeled the costs of the checks,
> then I don't think this would be a problem.
>
>
I don't think it's just the alias checks. There's also the min-iteration
check, broadcasts that get hoisted out of the loop, etc.

>
> 3) Technically speaking, constructing a new InnerLoopVectorizer to
> vectorize this one loop sounds weird. We already have a worklist in the
> vectorizer that's currently running.
>
>
> I agree, although we do want to reuse the cost and legality analysis
> (which I think is a worthwhile engineering decision because that analysis
> involves SCEV, AA, and TTI, all of which can get expensive). If we can do
> that and also just add the new loop to the work queue, that certainly might
> be cleaner.
>
>  -Hal
>
>
>
> I don't think (1) is a blocker, and (3) should be easy to fix, but I'm not
> sure whether the way this is going to handle (2) is sufficient.  If I'm the
> only one that this bothers, I won't stand in the way, but I'd like to at
> least make sure we've fully considered this.
>
> On Mar 14, 2017 06:00, "Nema, Ashutosh" <Ashutosh.Nema at amd.com> wrote:
>
>> Summarizing the discussion on the implementation approaches.
>>
>>
>>
>> Discussed about two approaches, first running ‘InnerLoopVectorizer’ again
>> on the epilog loop immediately after vectorizing the original loop within
>> the same vectorization pass, the second approach where re-running
>> vectorization pass and limiting vectorization factor of epilog loop by
>> metadata.
>>
>>
>>
>> <Approach-2>
>>
>> Challenges with re-running the vectorizer pass:
>>
>> 1)      Reusing alias check result:
>>
>> When vectorizer pass runs again it finds the epilog loop as a new loop
>> and it may generates alias check, this new alias check may overkill the
>> gains of epilog vectorization.
>>
>> We should use the already computed alias check result instead of re
>> computing again.
>>
>> 2)      Rerun the vectorizer and hoist the new alias check:
>>
>> It’s not possible to hoist alias checks as its not fully redundant (not
>> dominated by other checks), it’s not getting execute in all paths.
>>
>>
>>
>>
>>
>> NOTE: We cannot prepone alias check as its expensive compared to other
>> checks.
>>
>>
>>
>> <Approach-1>
>>
>> 1)      Current patch depends on the existing functionality of
>> LoopVectorizer, it uses ‘InnerLoopVectorizer’ again to vectorize the epilog
>> loop, as it happens in the same vectorization pass we have flexibility to
>> reuse already computed alias result check & limit vectorization factor for
>> the epilog loop.
>>
>> 2)      It does not generate the blocks for new block layout explicitly,
>> rather it depends on ‘InnerLoopVectorizer::createEmptyLoop’ to generate
>> new block layout. The new block layout get automatically generated by
>> calling the ‘InnerLoopVectorizer:: vectorize’ again.
>>
>> 3)      Block layout description with epilog loop vectorization is
>> available at
>>
>> https://reviews.llvm.org/file/data/fxg5vx3capyj257rrn5j/PHID
>> -FILE-x6thnbf6ub55ep5yhalu/LayoutDescription.png
>>
>>
>>
>> Approach-1 looks feasible, please comment if any objections.
>>
>>
>>
>> Regards,
>>
>> Ashutosh
>>
>>
>>
>>
>>
>> *From:* Nema, Ashutosh
>> *Sent:* Wednesday, March 1, 2017 10:42 AM
>> *To:* 'Daniel Berlin' <dberlin at dberlin.org>
>> *Cc:* anemet at apple.com; Hal Finkel <hfinkel at anl.gov>; Zaks, Ayal <
>> ayal.zaks at intel.com>; Renato Golin <renato.golin at linaro.org>;
>> mkuper at google.com; Mehdi Amini <mehdi.amini at apple.com>; llvm-dev <
>> llvm-dev at lists.llvm.org>
>> *Subject:* RE: [llvm-dev] [Proposal][RFC] Epilog loop vectorization
>>
>>
>>
>> Sorry I misunderstood, gvn/newgvn/gvnhoist cannot help here as these
>> checks are not dominated by all paths.
>>
>>
>>
>> Regards,
>>
>> Ashutosh
>>
>>
>>
>> *From:* Daniel Berlin [mailto:dberlin at dberlin.org <dberlin at dberlin.org>]
>> *Sent:* Tuesday, February 28, 2017 6:58 PM
>> *To:* Nema, Ashutosh <Ashutosh.Nema at amd.com>
>> *Cc:* anemet at apple.com; Hal Finkel <hfinkel at anl.gov>; Zaks, Ayal <
>> ayal.zaks at intel.com>; Renato Golin <renato.golin at linaro.org>;
>> mkuper at google.com; Mehdi Amini <mehdi.amini at apple.com>; llvm-dev <
>> llvm-dev at lists.llvm.org>
>> *Subject:* Re: [llvm-dev] [Proposal][RFC] Epilog loop vectorization
>>
>>
>>
>> Hoisting or removing?
>> Neither pass does hoisting, you'd need gvnhoist for that.
>>
>>
>>
>> Even then:
>> Staring at your example, none of the checks are fully redundant  (IE they
>> are not dominated by other checks) or execute on all paths.
>>
>>
>>
>> Thus, hoisting them  would be purely speculative code motion, which none
>> of our passes do.
>>
>>
>>
>> If you would like these sets of checks to be removed, you would need to
>> place them in a place that they execute unconditionally.
>>
>>
>>
>> Otherwise, this is not a standard code hoisting/removal transform.
>>
>>
>>
>> The only redundancy i can see here at all is the repeated getelementptr
>> computation.
>>
>> If you move it to the preheader, it will be eliminated.
>>
>> Otherwise, none of the checks are redundant.
>>
>>
>> What would you hope to happen in this case?
>>
>>
>>
>> On Tue, Feb 28, 2017 at 5:09 AM, Nema, Ashutosh <Ashutosh.Nema at amd.com>
>> wrote:
>>
>> I have tried running both gvn and newgvn but it did not helped in
>> hoisting the alias checks:
>>
>>
>>
>> Please check, maybe I have missed something.
>>
>>
>>
>> <TestCase>
>>
>> void foo (char *A, char *B, char *C, int len) {
>>
>>   int i = 0;
>>
>>   for (i=0 ; i< len; i++)
>>
>>     A[i] = B[i] + C[i];
>>
>> }
>>
>>
>>
>> <Command>
>>
>>   $ opt –O3 –gvn test.ll –o test.opt.ll
>>
>>   $ opt –O3 –newgvn test.ll –o test.opt.ll
>>
>>
>>
>> “test.ll” is attached, it got already vectorized by the approach running
>> vectorizer twice by annotate the remainder loop with metadata to limit the
>> vectorization factor for epilog vector loop.
>>
>>
>>
>> Regards,
>>
>> Ashutosh
>>
>>
>>
>> *From:* anemet at apple.com [mailto:anemet at apple.com]
>> *Sent:* Tuesday, February 28, 2017 1:33 AM
>> *To:* Hal Finkel <hfinkel at anl.gov>
>> *Cc:* Daniel Berlin <dberlin at dberlin.org>; Nema, Ashutosh <
>> Ashutosh.Nema at amd.com>; Zaks, Ayal <ayal.zaks at intel.com>; Renato Golin <
>> renato.golin at linaro.org>; mkuper at google.com; Mehdi Amini <
>> mehdi.amini at apple.com>; llvm-dev <llvm-dev at lists.llvm.org>
>> *Subject:* Re: [llvm-dev] [Proposal][RFC] Epilog loop vectorization
>>
>>
>>
>>
>>
>> On Feb 27, 2017, at 12:01 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>
>>
>>
>>
>>
>> On 02/27/2017 01:47 PM, Daniel Berlin wrote:
>>
>>
>>
>>
>>
>> On Mon, Feb 27, 2017 at 11:29 AM, Adam Nemet <anemet at apple.com> wrote:
>>
>>
>>
>> On Feb 27, 2017, at 10:11 AM, Hal Finkel <hfinkel at anl.gov> wrote:
>>
>>
>>
>>
>>
>> On 02/27/2017 11:47 AM, Adam Nemet wrote:
>>
>>
>>
>> On Feb 27, 2017, at 9:39 AM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Feb 27, 2017 at 9:29 AM, Adam Nemet <anemet at apple.com> wrote:
>>
>>
>>
>> On Feb 27, 2017, at 7:27 AM, Hal Finkel <hfinkel at anl.gov> wrote:
>>
>>
>>
>>
>> On 02/27/2017 06:29 AM, Nema, Ashutosh wrote:
>>
>> Thanks for looking into this.
>>
>>
>>
>> 1) Issues with re running vectorizer:
>>
>> Vectorizer might generate redundant alias checks while vectorizing epilog
>> loop.
>>
>> Redundant alias checks are expensive, we like to reuse the results of
>> already computed alias checks.
>>
>> With metadata we can limit the width of epilog loop, but not sure about
>> reusing alias check result.
>>
>> Any thoughts on rerunning vectorizer with reusing the alias check result ?
>>
>>
>> One way of looking at this is: Reusing the alias-check result is really
>> just a conditional propagation problem; if we don't already have an
>> optimization that can combine these after the fact, then we should.
>>
>>
>>
>> +Danny
>>
>>
>>
>> Isn’t Extended SSA supposed to help with this?
>>
>>
>>
>> Yes, it will solve this with no issue already.  GVN probably does already
>> too.
>>
>>
>>
>> even if if you have
>>
>>
>>
>> if (a == b)
>>
>> if (a == c)
>>
>>  if (a == d)
>>
>>  if (a == e)
>>
>>  if (a == g)
>>
>>
>>
>>
>>
>> and  we can prove a ... g equivalent, newgvn will eliminate them all and
>> set all the branches true.
>>
>>
>>
>> If you need a simpler clean up pass, we could run it on sub-graphs.
>>
>>
>>
>> Yes we probably don’t want to run a full GVN after the “loop-scheduling”
>> passes.
>>
>>
>> FWIW, we could, just without the memory-dependence analysis enabled (i.e.
>> set the NoLoads constructor parameter to true). GVN is pretty fast in that
>> mode.
>>
>>
>>
>> OK.  Another data point is that I’ve seen cases in the past where the
>> alias checks required for the loop passes could enable GVN to remove
>> redundant loads/stores.  Currently we can only pick these up with LTO when
>> GVN is rerun.
>>
>>
>>
>> This is just GVN brokenness, newgvn should not have this problem.
>>
>> If it does, i'd love to see it.
>>
>>
>> I thought that the problem is that we just don't run GVN after that point
>> in the pipeline.
>>
>>
>>
>> Yeah, that is the problem but I think Danny misunderstood what I was
>> trying to say.
>>
>>
>>
>> This was a datapoint to possibly rerun GVN with memory-awareness.
>>
>>
>>
>>
>>  -Hal
>>
>>
>>
>> (I'm working on the last few parts of turning it on by default, but it
>> requires a new getModRefInfo interface to be able to get the last few
>> testcases)
>>
>>
>>
>>
>>
>> --
>>
>> Hal Finkel
>>
>> Lead, Compiler Technology and Programming Languages
>>
>> Leadership Computing Facility
>>
>> Argonne National Laboratory
>>
>>
>>
>>
>>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170314/058ffccc/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 18444 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170314/058ffccc/attachment-0001.jpe>