[llvm-dev] [Proposal][RFC] Epilog loop vectorization
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Wed Mar 15 14:14:42 PDT 2017
On 03/15/2017 02:03 PM, Adam Nemet wrote:
>
>> On Mar 15, 2017, at 9:46 AM, Hal Finkel <hfinkel at anl.gov
>> <mailto:hfinkel at anl.gov>> wrote:
>>
>>
>> On 03/14/2017 07:50 PM, Adam Nemet wrote:
>>>
>>>> On Mar 14, 2017, at 11:33 AM, Hal Finkel <hfinkel at anl.gov
>>>> <mailto:hfinkel at anl.gov>> wrote:
>>>>
>>>>
>>>>
>>>> On 03/14/2017 12:11 PM, Adam Nemet wrote:
>>>>>
>>>>>> On Mar 14, 2017, at 9:49 AM, Hal Finkel <hfinkel at anl.gov
>>>>>> <mailto:hfinkel at anl.gov>> wrote:
>>>>>>
>>>>>>
>>>>>> On 03/14/2017 11:21 AM, Adam Nemet wrote:
>>>>>>>
>>>>>>>> On Mar 14, 2017, at 6:00 AM, Nema, Ashutosh
>>>>>>>> <Ashutosh.Nema at amd.com <mailto:Ashutosh.Nema at amd.com>> wrote:
>>>>>>>>
>>>>>>>> Summarizing the discussion on the implementation approaches.
>>>>>>>> Discussed about two approaches, first running
>>>>>>>> ‘InnerLoopVectorizer’ again on the epilog loop immediately
>>>>>>>> after vectorizing the original loop within the same
>>>>>>>> vectorization pass, the second approach where re-running
>>>>>>>> vectorization pass and limiting vectorization factor of epilog
>>>>>>>> loop by metadata.
>>>>>>>> <Approach-2>
>>>>>>>> Challenges with re-running the vectorizer pass:
>>>>>>>> 1)Reusing alias check result:
>>>>>>>> When vectorizer pass runs again it finds the epilog loop as a
>>>>>>>> new loop and it may generates alias check, this new alias check
>>>>>>>> may overkill the gains of epilog vectorization.
>>>>>>>> We should use the already computed alias check result instead
>>>>>>>> of re computing again.
>>>>>>>> 2)Rerun the vectorizer and hoist the new alias check:
>>>>>>>> It’s not possible to hoist alias checks as its not fully
>>>>>>>> redundant (not dominated by other checks), it’s not getting
>>>>>>>> execute in all paths.
>>>>>>>> <Mail Attachment.jpeg>
>>>>>>>
>>>>>>>
>>>>>>> I don’t understand. Looks like you have the same alias checks
>>>>>>> for the epilog loop too. How is this CFG different from the
>>>>>>> re-vectorization of the scalar loop?
>>>>>>
>>>>>> You're looking at the wrong thing. This *is* the image from
>>>>>> re-vectorization. The other image (linked below in step (3))
>>>>>> shows the other option.
>>>>>
>>>>> Ah ok, the numbering confused me here.
>>>>>
>>>>>>
>>>>>>> Would be good to have both CFGs here and highlighting the
>>>>>>> difference.
>>>>>>>
>>>>>>> I thought that the whole point was that *if* you reached the
>>>>>>> epilog vector loop via the first vector loop, you want to bypass
>>>>>>> the alias checks before the epilog vector.
>>>>>>
>>>>>> Yes, but, that's not quite true now. You can also reach the
>>>>>> epilogue loop if you fail the min-trip-count check, and so you
>>>>>> don't know anything about the aliasing checks.
>>>>>
>>>>> OK, so we want this loops to be handled specially. We effectively
>>>>> say that we only vectorize this loop if it does not require any
>>>>> alias checks or if the alias checks can be predicate-forwarded to
>>>>> this loop from existing checks.
>>>>>
>>>>> This still seems like an orthogonal issue that may be interesting
>>>>> to solve independently. In other words this could be a nice
>>>>> feature in the vectorizer anyway: the loop is estimated to be
>>>>> low-trip count so feel free to predicate the new vector loop so
>>>>> that the alias check result could be reused from some other block.
>>>>> We obviously don’t have this capability today but it’s something
>>>>> that could be nice aside from the vectorizer.
>>>>
>>>> That sounds great. I'm not sure, however, exactly what this means.
>>>> This "predicate forwarding" sounds like a non-local restructuring
>>>> of the surrounding code (because the predicates aren't known to be
>>>> true at that point, we need to search different predecessors to
>>>> find one in which the conditions might be true and insert the
>>>> vector loop across that predecessor edge instead). Maybe we could
>>>> do this, for example, by calling SE->isKnownPredicate enhanced with
>>>> some additional context sensitivity because we currently check
>>>> dominating conditional branches for things that are AddRecs in some
>>>> loop? Moreover, we then have the problem of restructuring the order
>>>> of the trip-count checks (because we need to fast fail to the
>>>> scalar loop for the smallest trip counts). Maybe we can do this the
>>>> same way? This means finding a dominating check on the trip count
>>>> that implies the check we're about to insert, change that check to
>>>> the check we want, keeping the original only on the true side of
>>>> the trip-count check (i.e. if the trip count is larger than the
>>>> small threshold, then check the large threshold).
>>>>
>>>> If we're going to do all of that, then I'd lean toward saying that
>>>> this does not belong in the vectorizer at all. Rather, this seems
>>>> like something we'd want in some general transformation (this seems
>>>> somewhat akin to what JumpThreading does). The general
>>>> transformation seems something like this; the basic vectorized loop
>>>> looks like this:
>>>>
>>>> int start = 0;
>>>> if (n >= vf) {
>>>> if (check) {
>>>> for (...; start += vf)
>>>> ...
>>>> }
>>>> }
>>>>
>>>> for (i = start; i < n; ++i) {
>>>> ...
>>>> }
>>>>
>>>> and after we vectorize the epilogue loop we end up with this:
>>>>
>>>> int start = 0;
>>>> if (n >= vf) {
>>>> if (check) {
>>>> for (...; start += vf)
>>>> ...
>>>> }
>>>> }
>>>>
>>>> if (n >= vf2) {
>>>
>>> This is (n - start >= vf2) which I think makes this a bit more
>>> complicated since now “check” is not always redundant if n >= vf2 so
>>> hoisting becomes a cost decision.
>>>
>>>> if (check) {
>>>
>>> But we can turn this into a predicate we already have: if (n >= vf
>>> && check). BTW, this is I think how Ashutosh’ approach 1 works too;
>>> if the first vector loop does not iterate once the alias result will
>>> be false in the epilog vector loop.
>>>
>>>> for (...; start += vf2)
>>>> ...
>>>> }
>>>> }
>>>>
>>>> for (i = start; i < n; ++i) {
>>>> ...
>>>> }
>>>>
>>>> and we need to end up with this:
>>>>
>>>> int start = 0;
>>>> if (n >= vf2) {
>>>> if (check) {
>>>> if (n >= vf) {
>>>> for (...; start += vf)
>>>> ...
>>>> }
>>>>
>>>> for (...; start += vf2)
>>>> ...
>>>> }
>>>> }
>>>>
>>>> for (i = start; i < n; ++i) {
>>>> ...
>>>> }
>>>>
>>>> where we've recognized here that 'check' is the same in both cases,
>>>> and that because vf2 < vf, the one trip-count check implies the
>>>> other. This latter part seems like the part that our existing
>>>> passes might not know what to do with currently. Thoughts?
>>>
>>>
>>> And then we get:
>>>
>>> int start = 0;
>>> *bool check_result = false;
>>> *if (n >= vf) {
>>> if (check) {
>>> *check_result = true;*
>>> for (...; start += vf)
>>> ...
>>> }
>>> }
>>>
>>> if (n - start >= vf2) {
>>> if (*check_result*) {
>>> for (...; start += vf2)
>>> ...
>>> }
>>> }
>>>
>>> for (i = start; i < n; ++i) {
>>> …
>>> }
>>>
>>> What do you think?
>>
>> I think this seems about right. We shouldn't re-compare the trip
>> counts in the epilogue loop if the alias check is false anyway:
>>
>> int start = 0;*
>> *if (n >= vf) {
>> if (check) {
>> for (...; start += vf)
>> ...
>> } else { goto scalar; }
>> } else { goto scalar; }
>>
>> if (n - start >= vf2) {
>> for (...; start += vf2)
>> ...
>> }
>>
>> scalar:
>> for (i = start; i < n; ++i) {
>> …
>> }
>
> So to summarize, as long as we guard the epilogue loop with 'n >= vf
> && check’, we should be able to jump-thread/predicate-foward to the
> vectorized epilog loop and remove the checks.
>
> Is this the direction then (rerun vectorizer + enhance/add the above
> general-purpose optimizations) or we’re still circling back to
> handling this with a single run of the vectorizer?
>
> I think that the general difficulty of trying to get the optimal
> version of code that has been loop-versioned multiple times is that we
> are interested in transformation that don’t necessarily preserve
> semantics. I.e. it’s OK to for example to fall back to the original
> loop in more cases since we happen to know that the original loop is
> equivalent to the versioned loop.
I think we've circled back to handling this in a single run of the
vectorizer for exactly this reason. Doing this after the fact would not
be semantics preserving.
-Hal
>
> Adam
>
...
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170315/3352c51a/attachment-0001.html>
More information about the llvm-dev
mailing list