[PATCH] D31910: [LV] Fix logic to extract correct element in first order recurrences

Tue Apr 11 09:22:38 PDT 2017

anna added a comment.

Hi Mathew,

In https://reviews.llvm.org/D31910#723721, @mssimpso wrote:

> Hi Anna,
>
> I think we can probably solve this without putting a limitation on the trip count. But I'm wondering if it would be better for now to just avoid vectorizing this case, at least until we can work out a more robust code generation strategy for the first-order recurrences. Why not change RecurrenceDescriptor::isFirstOrderRecurrence to ensure the phi has no users outside the loop? I think that's the underlying issue, right? We're currently assuming that only the phi update can be used externally. But in your test case, it's the phi that's used externally. Just to make sure I understand things correctly, after vectorization we end up with something like the following for the external use:
>
>   %r.lcssa = phi i32 [ %scalar.recur, %for.body ], [ %vector.recur.extract, %middle.block ]
>
>
> So if the scalar loop is executed, %r.lcssa is the phi, but if the trip count is evenly divided and the scalar loop is not executed, %r.lcssa is the phi's last update. And we currently assume we'll only encounter something like:
>
>   %r.lcssa = phi i32 [ %update, %for.body ], [ %vector.recur.extract, %middle.block ]
>
>
> where it's the update that's used externally.

Yes, that's right. The problem only occurs when we need the phi and not it's update at the end of loop. But there's 2 checks if we want to vectorize all cases except for the incorrect code gen cases:

1. a phi node is used externally, and
2. that phi node is not the loop termination condition.

If the phi node is the loop terminating condition, we would get the correct value for the phi even if we vectorize the loop completely (Trip count% step= 0).
This is the case (for VF = 4):

  for.body:
    %inc.phi = phi i32 [ 0, %entry ], [ %inc, %for.body ]
    %val.phi = phi i32 [ 0, %entry ], [ %addx, %for.body ]
    %inc = add nsw i32 %inc.phi, 1
    %bc = zext i32 %inc.phi to i64.
    %addx = add nuw nsw i32 %inc.phi, %x
    %cmp = icmp eq i32 %inc.phi, 95. 
    br i1 %cmp, label %for.end, label %for.body

  for.end:
    ret i32 %inc.phi

> I think the solution would eventually be something like:
> 
>   middle.block:
>     %vector.recur.extract = extractelement <4 x i32> %vector.update, i32 3
>     %new.extract = extractelement <4 x i32> %vector.recur, i32 3
>   scalar.ph:
>     %scalar.recur.init = phi i32 [ %vector.recur.extract, %middle.block ], ...
>   for.end:
>     %r.lcssa = phi i32 [ %scalar.recur, %for.body ], [ %new.extract, %middle.block ]
> 
> 
> where we add an additional extract for the last element of the vector phi, if the phi is used externally. What do you think? We could disable the external use for now to fix the correctness issue and then re-enable it once we get the code generation right.

Actually, this is a good idea, it would have a lesser chance of causing any perf regression than the limiting of trip count. The reason I went with trip count limitation was I was thinking of this bug as extracting from the "correct index" i.e. the last or the second last, instead of generating 2 extracts from the phi and it's last update. And currently, we dont have a way of specifying a runtime index for the extract operation. 
With the code gen change there would also be a change to the cost model to consider this added shuffle instruction.

I'll work on the disabling this part as the first step (and work on the robust code gen as a follow up change). Thanks for the input!

https://reviews.llvm.org/D31910