[PATCH] D77635: [LV] Vectorize with FoldTail when Primary Induction is absent

Tue Apr 14 02:05:07 PDT 2020

Ayal added a comment.

In D77635#1979648 <https://reviews.llvm.org/D77635#1979648>, @skatkov wrote:

> Hello @Ayal, unfortunately this patch causes the functional regression.
>  For the test below, vectorizer decided to vectorize inner loop by 32 while it has only a couple of iteration and it causes a miscompile.
>  Please fix it quickly or revert the patch.
>
> The reproducer:
>
>   ; ModuleID = './repro.ll'
>   source_filename = "./repro.ll"
>   target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128-ni:1-p2:32:8:8:32-ni:2"
>   target triple = "x86_64-unknown-linux-gnu"
>  
>   @global = external global i8*
>  
>   define void @hoge(i8* nonnull align 8 dereferenceable_or_null(8) %arg, i8* align 8 dereferenceable_or_null(16) %arg1) {
>   bb:
>     %tmp = load atomic i8*, i8** @global unordered, align 8
>     %tmp2 = getelementptr inbounds i8, i8* %tmp, i64 852
>     br label %bb3
>  
>   bb3:                                              ; preds = %bb12, %bb
>     %tmp4 = phi i32 [ 1, %bb ], [ %tmp15, %bb12 ]
>     %tmp5 = phi i32 [ 0, %bb ], [ %tmp8, %bb12 ]
>     br label %bb7
>  
>   bb6:                                              ; preds = %bb12
>     ret void
>  
>   bb7:                                              ; preds = %bb7, %bb3
>     %tmp8 = phi i32 [ %tmp5, %bb3 ], [ %tmp10, %bb7 ]
>     %tmp9 = phi i32 [ 1, %bb3 ], [ %tmp10, %bb7 ]
>     %tmp10 = add nuw nsw i32 %tmp9, 1
>     %tmp11 = icmp ugt i32 %tmp9, 5
>     br i1 %tmp11, label %bb12, label %bb7
>  
>   bb12:                                             ; preds = %bb7
>     %tmp13 = mul i32 %tmp8, %tmp4
>     %tmp14 = trunc i32 %tmp13 to i8
>     fence release
>     store atomic i8 %tmp14, i8* %tmp2 unordered, align 1
>     fence seq_cst
>     %tmp15 = add nuw nsw i32 %tmp4, 1
>     %tmp16 = icmp ult i32 %tmp4, 240
>     br i1 %tmp16, label %bb3, label %bb6
>   }
>
>
> ran as
>
>   > opt -passes=loop-vectorize -S -o res.ll ./repro.ll
>   

Thanks @skatkov. The test compiles for me, and the part that this patch introduces looks correct, but there seems to be a problem with how %tmp8 is handled - as a live-out first-order-recurrence which fold-tail does not handle (the compare it introduces is not used by anyone). To reproduce the bug w/o this patch, transform the loop iv %tmp9 to start at 0 and exit the loop when equal to 4 (instead of starting at 1 and exiting at 5), and add 1 to %tmp8. Would be good to open a PR.
Continuing to investigate.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D77635/new/

https://reviews.llvm.org/D77635