[llvm-dev] LoopVectorize fails to vectorize loops with induction variables with PtrToInt/IntToPtr conversions

Sat Jun 17 16:07:44 PDT 2017

Sorry, hit reply instead of forward :)

On Sat, Jun 17, 2017 at 4:07 PM, Davide Italiano <davide at freebsd.org> wrote:
> FYI.
>
> On Sat, Jun 17, 2017 at 3:41 PM, Adrien Guinet via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> Hello all,
>>
>> There is a missing vectorization opportunity issue with clang 4.0 with
>> the file attached.
>>
>> Indeed, when compiled with -O2, the "op_distance" function get
>> vectorized, but not the "op" one.
>>
>> For information, this test case has been reduced from a file generated
>> by the Pythran compiler (https://github.com/serge-sans-paille/pythran).
>>
>> If we take a look at the generated IR without vectorization (using the
>> -fno-vectorize clang flag), we get:
>>
>>> $ clang -O2 -S -emit-llvm op_zip_iterator.cpp -std=c++11 -o - -fno-vectorize
>>
>>> ; Function Attrs: norecurse uwtable
>>> define void @_Z11op_distancePi16add_zip_iteratorS0_(i32* nocapture, i32*, i32* nocapture readonly, i32*, i32* nocapture readnone) local_unnamed_addr #0 {
>>> ; This one is vectorized!
>>>   %6 = ptrtoint i32* %1 to i64
>>>   %7 = ptrtoint i32* %3 to i64
>>>   %8 = sub i64 %7, %6
>>>   %9 = icmp sgt i64 %8, 0
>>>   br i1 %9, label %10, label %26
>>>
>>> ; <label>:10:                                     ; preds = %5
>>>   %11 = lshr exact i64 %8, 2
>>>   br label %12
>>>
>>> ; <label>:12:                                     ; preds = %12, %10
>>>   %13 = phi i64 [ %23, %12 ], [ %11, %10 ]
>>>   %14 = phi i32* [ %22, %12 ], [ %0, %10 ]
>>>   %15 = phi i32* [ %21, %12 ], [ %2, %10 ]
>>>   %16 = phi i32* [ %20, %12 ], [ %1, %10 ]
>>>   %17 = load i32, i32* %16, align 4, !tbaa !1
>>>   %18 = load i32, i32* %15, align 4, !tbaa !1
>>>   %19 = add nsw i32 %18, %17
>>>   store i32 %19, i32* %14, align 4, !tbaa !1
>>>   %20 = getelementptr inbounds i32, i32* %16, i64 1
>>>   %21 = getelementptr inbounds i32, i32* %15, i64 1
>>>   %22 = getelementptr inbounds i32, i32* %14, i64 1
>>>   %23 = add nsw i64 %13, -1
>>>   %24 = icmp sgt i64 %13, 1
>>>   br i1 %24, label %12, label %25
>>>
>>> ; <label>:25:                                     ; preds = %12
>>>   br label %26
>>>
>>> ; <label>:26:                                     ; preds = %25, %5
>>>   ret void
>>> }
>>>
>>> ; Function Attrs: norecurse uwtable
>>> define void @_Z2opPi16add_zip_iteratorS0_(i32* nocapture, i32*, i32* nocapture readonly, i32*, i32* nocapture readnone) local_unnamed_addr #0 {
>>> ; This one isn't!
>>>   %6 = ptrtoint i32* %1 to i64
>>>   %7 = ptrtoint i32* %3 to i64
>>>   %8 = sub i64 %6, %7
>>>   %9 = icmp sgt i64 %8, 0
>>>   br i1 %9, label %10, label %28
>>>
>>> ; <label>:10:                                     ; preds = %5
>>>   %11 = lshr exact i64 %8, 2
>>>   br label %12
>>>
>>> ; <label>:12:                                     ; preds = %12, %10
>>>   %13 = phi i64 [ %25, %12 ], [ %11, %10 ]
>>>   %14 = phi i32* [ %24, %12 ], [ %0, %10 ]
>>>   %15 = phi i32* [ %23, %12 ], [ %2, %10 ]
>>>   %16 = phi i64 [ %22, %12 ], [ %6, %10 ]
>>>   %17 = inttoptr i64 %16 to i32*
>>>   %18 = load i32, i32* %17, align 4, !tbaa !1
>>>   %19 = load i32, i32* %15, align 4, !tbaa !1
>>>   %20 = add nsw i32 %19, %18
>>>   store i32 %20, i32* %14, align 4, !tbaa !1
>>>   %21 = getelementptr inbounds i32, i32* %17, i64 1
>>>   %22 = ptrtoint i32* %21 to i64
>>>   %23 = getelementptr inbounds i32, i32* %15, i64 1
>>>   %24 = getelementptr inbounds i32, i32* %14, i64 1
>>>   %25 = add nsw i64 %13, -1
>>>   %26 = icmp sgt i64 %13, 1
>>>   br i1 %26, label %12, label %27
>>>
>>> ; <label>:27:                                     ; preds = %12
>>>   br label %28
>>>
>>> ; <label>:28:                                     ; preds = %27, %5
>>>   ret void
>>> }
>>
>> If we compile only the "op" function while activation the debug mode,
>> here is the output:
>>
>>> $ clang -O2 -S -emit-llvm op_zip_iterator.cpp -std=c++11 -o - -fno-vectorize |~/dev/epona-llvm/build_debug_shared/bin/opt -debug -debug-only loop-vectorize -O2 -S
>>>
>>> LV: Checking a loop in "_Z2opPi16add_zip_iteratorS0_" from <stdin>
>>> LV: Loop hints: force=? width=0 unroll=0
>>> LV: Found a loop:
>>> LV: Found an induction variable.
>>> LV: Found an induction variable.
>>> LV: Found an induction variable.
>>> LV: Found an unidentified PHI.  %16 = phi i64 [ %22, %12 ], [ %6, %10 ]
>>> LV: Can't vectorize the instructions or CFG
>>> LV: Not vectorizing: Cannot prove legality.
>>> [...]
>>
>> The issue seems to be that the phi node "%16" can't be deduced as an
>> induction variable. If we take a closer look, the cause seems to be in
>> ScalarEvolution, in the createSCEV function
>> (http://llvm.org/docs/doxygen/html/ScalarEvolution_8cpp_source.html#l04770)
>> :
>>
>>>  // It's tempting to handle inttoptr and ptrtoint as no-ops, however this can
>>>  // lead to pointer expressions which cannot safely be expanded to GEPs,
>>>  // because ScalarEvolution doesn't respect the GEP aliasing rules when
>>>  // simplifying integer expressions.
>>
>> Indeed, SCEV does not (legitimately) consider inttoptr/ptrtoint as
>> no-op, and does not handle them. The thing is that, in our case, the GEP
>> in %23 is thus not analyzed by SCEV, and the PHI %16 is thus not
>> considered as an induction variable.
>>
>> To confirm this hypothesis, I created a small out-of-tree pass
>> (https://github.com/aguinet/llvm-intptrcleanup) which registers before
>> loop vectorization and does the following:
>>
>> * first, it search for phi nodes who have those properties:
>>   - every incoming value of the phi node is a ptrtoint instruction. The
>> original pointer type of every ptrtoint instruction must be the same type T.
>>   - every user of this PHI node is an inttoptr instruction of the
>> previous type T
>> * for each of these PHI nodes, it creates a new PHI node which takes the
>> original pointers as incoming values, and replace the uses of the
>> inttoptr instructions that uses the original PHI node by the new one
>> * it then removes the previous inttoptr instructions and the original
>> PHI node
>>
>> The way I understand inttoptr and ptrtoint, this transformation should
>> be valid (but I might have missed something!). Please note that this is
>> a quick'n'dirty pass, which hasn't been heavily tested. Using this pass,
>> the previous example is now vectorized correctly by the loop vectorizer.
>> This can be seen by looking at the output of:
>>
>>> $ clang -Xclang -load -Xclang IntToPtrCleanup.so -O2 ./example/op_zip_operator.cpp -S -emit-llvm -o - -std=c++11
>>
>> The question that remains to me is how this should be correctly fixed:
>>
>> 1) Making SCEV support these no-op (in this case) inttoptr/ptrtoint
>> conversions
>> 2) insert the above transformation at some point in the optimization
>> pipeline
>> 3) clean the pass(es?) that somehow generated this case.
>>
>> I have to admit I'm not really sure which options is the best. 3) seems
>> to be the way to go but might require some tedious work, and does not
>> prevent the issue to come again in the future. 2) seems to be a quick
>> patch that could be inserted in some "canonicalization" pass, let it be
>> a valid transformation in the first place. I don't know SCEV enough to
>> judge of the difficulty/faisability of 1).
>>
>> This mail is thus to discuss this issue and how to fix this properly :)
>>
>> Thanks everyone :)
>>
>> --
>> Adrien Guinet.
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>
>
> --
> Davide
>
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare


-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare