[llvm-dev] [Proposal][RFC] Epilog loop vectorization

Wed Mar 15 14:14:42 PDT 2017

On 03/15/2017 02:03 PM, Adam Nemet wrote:
>
>> On Mar 15, 2017, at 9:46 AM, Hal Finkel <hfinkel at anl.gov 
>> <mailto:hfinkel at anl.gov>> wrote:
>>
>>
>> On 03/14/2017 07:50 PM, Adam Nemet wrote:
>>>
>>>> On Mar 14, 2017, at 11:33 AM, Hal Finkel <hfinkel at anl.gov 
>>>> <mailto:hfinkel at anl.gov>> wrote:
>>>>
>>>>
>>>>
>>>> On 03/14/2017 12:11 PM, Adam Nemet wrote:
>>>>>
>>>>>> On Mar 14, 2017, at 9:49 AM, Hal Finkel <hfinkel at anl.gov 
>>>>>> <mailto:hfinkel at anl.gov>> wrote:
>>>>>>
>>>>>>
>>>>>> On 03/14/2017 11:21 AM, Adam Nemet wrote:
>>>>>>>
>>>>>>>> On Mar 14, 2017, at 6:00 AM, Nema, Ashutosh 
>>>>>>>> <Ashutosh.Nema at amd.com <mailto:Ashutosh.Nema at amd.com>> wrote:
>>>>>>>>
>>>>>>>> Summarizing the discussion on the implementation approaches.
>>>>>>>> Discussed about two approaches, first running 
>>>>>>>> ‘InnerLoopVectorizer’ again on the epilog loop immediately 
>>>>>>>> after vectorizing the original loop within the same 
>>>>>>>> vectorization pass, the second approach where re-running 
>>>>>>>> vectorization pass and limiting vectorization factor of epilog 
>>>>>>>> loop by metadata.
>>>>>>>> <Approach-2>
>>>>>>>> Challenges with re-running the vectorizer pass:
>>>>>>>> 1)Reusing alias check result:
>>>>>>>> When vectorizer pass runs again it finds the epilog loop as a 
>>>>>>>> new loop and it may generates alias check, this new alias check 
>>>>>>>> may overkill the gains of epilog vectorization.
>>>>>>>> We should use the already computed alias check result instead 
>>>>>>>> of re computing again.
>>>>>>>> 2)Rerun the vectorizer and hoist the new alias check:
>>>>>>>> It’s not possible to hoist alias checks as its not fully 
>>>>>>>> redundant (not dominated by other checks), it’s not getting 
>>>>>>>> execute in all paths.
>>>>>>>> <Mail Attachment.jpeg>
>>>>>>>
>>>>>>>
>>>>>>> I don’t understand. Looks like you have the same alias checks 
>>>>>>> for the epilog loop too.  How is this CFG different from the 
>>>>>>> re-vectorization of the scalar loop?
>>>>>>
>>>>>> You're looking at the wrong thing. This *is* the image from 
>>>>>> re-vectorization. The other image (linked below in step (3)) 
>>>>>> shows the other option.
>>>>>
>>>>> Ah ok, the numbering confused me here.
>>>>>
>>>>>>
>>>>>>>  Would be good to have both CFGs here and highlighting the 
>>>>>>> difference.
>>>>>>>
>>>>>>> I thought that the whole point was that *if* you reached the 
>>>>>>> epilog vector loop via the first vector loop, you want to bypass 
>>>>>>> the alias checks before the epilog vector.
>>>>>>
>>>>>> Yes, but, that's not quite true now. You can also reach the 
>>>>>> epilogue loop if you fail the min-trip-count check, and so you 
>>>>>> don't know anything about the aliasing checks.
>>>>>
>>>>> OK, so we want this loops to be handled specially.  We effectively 
>>>>> say that we only vectorize this loop if it does not require any 
>>>>> alias checks or if the alias checks can be predicate-forwarded to 
>>>>> this loop from existing checks.
>>>>>
>>>>> This still seems like an orthogonal issue that may be interesting 
>>>>> to solve independently.  In other words this could be a nice 
>>>>> feature in the vectorizer anyway: the loop is estimated to be 
>>>>> low-trip count so feel free to predicate the new vector loop so 
>>>>> that the alias check result could be reused from some other block. 
>>>>>  We obviously don’t have this capability today but it’s something 
>>>>> that could be nice aside from the vectorizer.
>>>>
>>>> That sounds great. I'm not sure, however, exactly what this means. 
>>>> This "predicate forwarding" sounds like a non-local restructuring 
>>>> of the surrounding code (because the predicates aren't known to be 
>>>> true at that point, we need to search different predecessors to 
>>>> find one in which the conditions might be true and insert the 
>>>> vector loop across that predecessor edge instead). Maybe we could 
>>>> do this, for example, by calling SE->isKnownPredicate enhanced with 
>>>> some additional context sensitivity because we currently check 
>>>> dominating conditional branches for things that are AddRecs in some 
>>>> loop? Moreover, we then have the problem of restructuring the order 
>>>> of the trip-count checks (because we need to fast fail to the 
>>>> scalar loop for the smallest trip counts). Maybe we can do this the 
>>>> same way? This means finding a dominating check on the trip count 
>>>> that implies the check we're about to insert, change that check to 
>>>> the check we want, keeping the original only on the true side of 
>>>> the trip-count check (i.e. if the trip count is larger than the 
>>>> small threshold, then check the large threshold).
>>>>
>>>> If we're going to do all of that, then I'd lean toward saying that 
>>>> this does not belong in the vectorizer at all. Rather, this seems 
>>>> like something we'd want in some general transformation (this seems 
>>>> somewhat akin to what JumpThreading does). The general 
>>>> transformation seems something like this; the basic vectorized loop 
>>>> looks like this:
>>>>
>>>> int start = 0;
>>>> if (n >= vf) {
>>>>   if (check) {
>>>>     for (...; start += vf)
>>>>       ...
>>>>   }
>>>> }
>>>>
>>>> for (i = start; i < n; ++i) {
>>>>   ...
>>>> }
>>>>
>>>> and after we vectorize the epilogue loop we end up with this:
>>>>
>>>> int start = 0;
>>>> if (n >= vf) {
>>>>   if (check) {
>>>>     for (...; start += vf)
>>>>       ...
>>>>   }
>>>> }
>>>>
>>>> if (n >= vf2) {
>>>
>>> This is (n - start >= vf2) which I think makes this a bit more 
>>> complicated since now “check” is not always redundant if n >= vf2 so 
>>> hoisting becomes a cost decision.
>>>
>>>>   if (check) {
>>>
>>> But we can turn this into a predicate we already have: if (n >= vf 
>>> && check).  BTW, this is I think how Ashutosh’ approach 1 works too; 
>>> if the first vector loop does not iterate once the alias result will 
>>> be false in the epilog vector loop.
>>>
>>>>     for (...; start += vf2)
>>>>       ...
>>>>   }
>>>> }
>>>>
>>>> for (i = start; i < n; ++i) {
>>>>   ...
>>>> }
>>>>
>>>> and we need to end up with this:
>>>>
>>>> int start = 0;
>>>> if (n >= vf2) {
>>>>   if (check) {
>>>>     if (n >= vf) {
>>>>       for (...; start += vf)
>>>>         ...
>>>>     }
>>>>
>>>>     for (...; start += vf2)
>>>>       ...
>>>>   }
>>>> }
>>>>
>>>> for (i = start; i < n; ++i) {
>>>>   ...
>>>> }
>>>>
>>>> where we've recognized here that 'check' is the same in both cases, 
>>>> and that because vf2 < vf, the one trip-count check implies the 
>>>> other. This latter part seems like the part that our existing 
>>>> passes might not know what to do with currently. Thoughts?
>>>
>>>
>>> And then we get:
>>>
>>> int start = 0;
>>> *bool check_result = false;
>>> *if (n >= vf) {
>>>   if (check) {
>>> *check_result = true;*
>>>     for (...; start += vf)
>>>       ...
>>>   }
>>> }
>>>
>>> if (n - start >= vf2) {
>>>   if (*check_result*) {
>>>     for (...; start += vf2)
>>>       ...
>>>   }
>>> }
>>>
>>> for (i = start; i < n; ++i) {
>>>   …
>>> }
>>>
>>> What do you think?
>>
>> I think this seems about right. We shouldn't re-compare the trip 
>> counts in the epilogue loop if the alias check is false anyway:
>>
>> int start = 0;*
>> *if (n >= vf) {
>>   if (check) {
>>     for (...; start += vf)
>>       ...
>>   } else { goto scalar; }
>> } else { goto scalar; }
>>
>> if (n - start >= vf2) {
>>   for (...; start += vf2)
>>   ...
>> }
>>
>> scalar:
>> for (i = start; i < n; ++i) {
>>   …
>> }
>
> So to summarize, as long as we guard the epilogue loop with 'n >= vf 
> && check’, we should be able to jump-thread/predicate-foward to the 
> vectorized epilog loop and remove the checks.
>
> Is this the direction then (rerun vectorizer + enhance/add the above 
> general-purpose optimizations) or we’re still circling back to 
> handling this with a single run of the vectorizer?
>
> I think that the general difficulty of trying to get the optimal 
> version of code that has been loop-versioned multiple times is that we 
> are interested in transformation that don’t necessarily preserve 
> semantics.  I.e. it’s OK to for example to fall back to the original 
> loop in more cases since we happen to know that the original loop is 
> equivalent to the versioned loop.

I think we've circled back to handling this in a single run of the 
vectorizer for exactly this reason. Doing this after the fact would not 
be semantics preserving.

  -Hal

>
> Adam
>
...

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170315/3352c51a/attachment-0001.html>