[llvm-dev] [Proposal][RFC] Epilog loop vectorization

Wed Mar 15 09:46:18 PDT 2017

On 03/14/2017 07:50 PM, Adam Nemet wrote:
>
>> On Mar 14, 2017, at 11:33 AM, Hal Finkel <hfinkel at anl.gov 
>> <mailto:hfinkel at anl.gov>> wrote:
>>
>>
>>
>> On 03/14/2017 12:11 PM, Adam Nemet wrote:
>>>
>>>> On Mar 14, 2017, at 9:49 AM, Hal Finkel <hfinkel at anl.gov 
>>>> <mailto:hfinkel at anl.gov>> wrote:
>>>>
>>>>
>>>> On 03/14/2017 11:21 AM, Adam Nemet wrote:
>>>>>
>>>>>> On Mar 14, 2017, at 6:00 AM, Nema, Ashutosh 
>>>>>> <Ashutosh.Nema at amd.com <mailto:Ashutosh.Nema at amd.com>> wrote:
>>>>>>
>>>>>> Summarizing the discussion on the implementation approaches.
>>>>>> Discussed about two approaches, first running 
>>>>>> ‘InnerLoopVectorizer’ again on the epilog loop immediately after 
>>>>>> vectorizing the original loop within the same vectorization pass, 
>>>>>> the second approach where re-running vectorization pass and 
>>>>>> limiting vectorization factor of epilog loop by metadata.
>>>>>> <Approach-2>
>>>>>> Challenges with re-running the vectorizer pass:
>>>>>> 1)Reusing alias check result:
>>>>>> When vectorizer pass runs again it finds the epilog loop as a new 
>>>>>> loop and it may generates alias check, this new alias check may 
>>>>>> overkill the gains of epilog vectorization.
>>>>>> We should use the already computed alias check result instead of 
>>>>>> re computing again.
>>>>>> 2)Rerun the vectorizer and hoist the new alias check:
>>>>>> It’s not possible to hoist alias checks as its not fully 
>>>>>> redundant (not dominated by other checks), it’s not getting 
>>>>>> execute in all paths.
>>>>>> <Mail Attachment.jpeg>
>>>>>
>>>>>
>>>>> I don’t understand. Looks like you have the same alias checks for 
>>>>> the epilog loop too.  How is this CFG different from the 
>>>>> re-vectorization of the scalar loop?
>>>>
>>>> You're looking at the wrong thing. This *is* the image from 
>>>> re-vectorization. The other image (linked below in step (3)) shows 
>>>> the other option.
>>>
>>> Ah ok, the numbering confused me here.
>>>
>>>>
>>>>>  Would be good to have both CFGs here and highlighting the difference.
>>>>>
>>>>> I thought that the whole point was that *if* you reached the 
>>>>> epilog vector loop via the first vector loop, you want to bypass 
>>>>> the alias checks before the epilog vector.
>>>>
>>>> Yes, but, that's not quite true now. You can also reach the 
>>>> epilogue loop if you fail the min-trip-count check, and so you 
>>>> don't know anything about the aliasing checks.
>>>
>>> OK, so we want this loops to be handled specially.  We effectively 
>>> say that we only vectorize this loop if it does not require any 
>>> alias checks or if the alias checks can be predicate-forwarded to 
>>> this loop from existing checks.
>>>
>>> This still seems like an orthogonal issue that may be interesting to 
>>> solve independently.  In other words this could be a nice feature in 
>>> the vectorizer anyway: the loop is estimated to be low-trip count so 
>>> feel free to predicate the new vector loop so that the alias check 
>>> result could be reused from some other block.  We obviously don’t 
>>> have this capability today but it’s something that could be nice 
>>> aside from the vectorizer.
>>
>> That sounds great. I'm not sure, however, exactly what this means. 
>> This "predicate forwarding" sounds like a non-local restructuring of 
>> the surrounding code (because the predicates aren't known to be true 
>> at that point, we need to search different predecessors to find one 
>> in which the conditions might be true and insert the vector loop 
>> across that predecessor edge instead). Maybe we could do this, for 
>> example, by calling SE->isKnownPredicate enhanced with some 
>> additional context sensitivity because we currently check dominating 
>> conditional branches for things that are AddRecs in some loop? 
>> Moreover, we then have the problem of restructuring the order of the 
>> trip-count checks (because we need to fast fail to the scalar loop 
>> for the smallest trip counts). Maybe we can do this the same way? 
>> This means finding a dominating check on the trip count that implies 
>> the check we're about to insert, change that check to the check we 
>> want, keeping the original only on the true side of the trip-count 
>> check (i.e. if the trip count is larger than the small threshold, 
>> then check the large threshold).
>>
>> If we're going to do all of that, then I'd lean toward saying that 
>> this does not belong in the vectorizer at all. Rather, this seems 
>> like something we'd want in some general transformation (this seems 
>> somewhat akin to what JumpThreading does). The general transformation 
>> seems something like this; the basic vectorized loop looks like this:
>>
>> int start = 0;
>> if (n >= vf) {
>>   if (check) {
>>     for (...; start += vf)
>>       ...
>>   }
>> }
>>
>> for (i = start; i < n; ++i) {
>>   ...
>> }
>>
>> and after we vectorize the epilogue loop we end up with this:
>>
>> int start = 0;
>> if (n >= vf) {
>>   if (check) {
>>     for (...; start += vf)
>>       ...
>>   }
>> }
>>
>> if (n >= vf2) {
>
> This is (n - start >= vf2) which I think makes this a bit more 
> complicated since now “check” is not always redundant if n >= vf2 so 
> hoisting becomes a cost decision.
>
>>   if (check) {
>
> But we can turn this into a predicate we already have: if (n >= vf && 
> check).  BTW, this is I think how Ashutosh’ approach 1 works too; if 
> the first vector loop does not iterate once the alias result will be 
> false in the epilog vector loop.
>
>>     for (...; start += vf2)
>>       ...
>>   }
>> }
>>
>> for (i = start; i < n; ++i) {
>>   ...
>> }
>>
>> and we need to end up with this:
>>
>> int start = 0;
>> if (n >= vf2) {
>>   if (check) {
>>     if (n >= vf) {
>>       for (...; start += vf)
>>         ...
>>     }
>>
>>     for (...; start += vf2)
>>       ...
>>   }
>> }
>>
>> for (i = start; i < n; ++i) {
>>   ...
>> }
>>
>> where we've recognized here that 'check' is the same in both cases, 
>> and that because vf2 < vf, the one trip-count check implies the 
>> other. This latter part seems like the part that our existing passes 
>> might not know what to do with currently. Thoughts?
>
>
> And then we get:
>
> int start = 0;
> *bool check_result = false;
> *if (n >= vf) {
>   if (check) {
> *check_result = true;*
>     for (...; start += vf)
>       ...
>   }
> }
>
> if (n - start >= vf2) {
>   if (*check_result*) {
>     for (...; start += vf2)
>       ...
>   }
> }
>
> for (i = start; i < n; ++i) {
>   …
> }
>
> What do you think?

I think this seems about right. We shouldn't re-compare the trip counts 
in the epilogue loop if the alias check is false anyway:

int start = 0;*
*if (n >= vf) {
   if (check) {
     for (...; start += vf)
       ...
   } else { goto scalar; }
} else { goto scalar; }

if (n - start >= vf2) {
   for (...; start += vf2)
   ...
}

scalar:
for (i = start; i < n; ++i) {
   …
}

  -Hal

>
> Adam
>
>>
>>  -Hal
>>
>>>
>>> Adam
>>>
>>>>
>>>>  -Hal
>>>>
>>>>>
>>>>> I still don’t understand why that’s not possible with some 
>>>>> sophisticated predicate propagation independent from the 
>>>>> vectorizer.  I am not saying it’s already possible but it should be.
>>>>>
>>>>> Adam
>>>>>
>>>>>
>>>>>> NOTE: We cannot prepone alias check as its expensive compared to 
>>>>>> other checks.
>>>>>> <Approach-1>
>>>>>> 1)Current patch depends on the existing functionality of 
>>>>>> LoopVectorizer, it uses ‘InnerLoopVectorizer’ again to vectorize 
>>>>>> the epilog loop, as it happens in the same vectorization pass we 
>>>>>> have flexibility to reuse already computed alias result check & 
>>>>>> limit vectorization factor for the epilog loop.
>>>>>> 2)It does not generate the blocks for new block layout 
>>>>>> explicitly, rather it depends on 
>>>>>> ‘InnerLoopVectorizer::createEmptyLoop’ to generate new block 
>>>>>> layout. The new block layout get automatically generated by 
>>>>>> calling the ‘InnerLoopVectorizer:: vectorize’ again.
>>>>>> 3)Block layout description with epilog loop vectorization is 
>>>>>> available at
>>>>>> https://reviews.llvm.org/file/data/fxg5vx3capyj257rrn5j/PHID-FILE-x6thnbf6ub55ep5yhalu/LayoutDescription.png
>>>>>> Approach-1 looks feasible, please comment if any objections.
>>>>>> Regards,
>>>>>> Ashutosh
>>>>>> ...
>>
>> -- 
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170315/a610dec3/attachment-0001.html>