[LLVMdev] SLP vectorizer on AVX feature

Wed Jul 1 13:04:53 PDT 2015

Sanjay,

you're right! I used the loop vectorizer in the earlier version.

Increasing the magic number in the SLP vectorizer solved the issue. Now, 
the code is vectorized with AVX instructions :-)

Thanks,
Frank

On 07/01/2015 03:53 PM, Sanjay Patel wrote:
> 128-bit wide vectorization is the limit for the SLP vectorizer:
> https://llvm.org/bugs/show_bug.cgi?id=17170#c8
>
> Is it possible that the cases where you saw 256-bit ops were 
> transformed by the loop vectorizer rather than the SLP vectorizer?
>
> On Wed, Jul 1, 2015 at 1:18 PM, Frank Winter <fwinter at jlab.org 
> <mailto:fwinter at jlab.org>> wrote:
>
>     Nadav,
>
>     I can check if we have a Haswell CPU somewhere running..
>
>     In the meantime I send the link to the debug output of the SLP
>     vectorizer. I don't understand all of it quite yet, but it seems
>     it's not mentioning the 8-fold vectorization opportunity...
>     (please find it here as it's 150KB and slightly over the list
>     attachment limit of 100KB
>     https://www.dropbox.com/s/aarivrzees30zrj/SLP.txt?dl=0)
>
>     Also, in a earlier version of my application I saw on similar
>     functions that the SLP vectorizer uses 8xfloat on the same
>     hardward (Sandy Bridge). In those versions I used LLVM 3.4 or 3.5
>     (trunk).
>
>     Thanks,
>     Frank
>
>
>
>     On 07/01/2015 03:02 PM, Nadav Rotem wrote:
>
>         Frank,
>
>         It sounds like the SLP vectorizer thinks that it is more
>         profitable to use 128bit wide operations (because 256bit
>         operations are double pumped on Sandybridge). Did you see a
>         different result on Haswell?
>
>         Thanks,
>         Nadav
>
>
>             On Jul 1, 2015, at 11:06 AM, Frank Winter
>             <fwinter at jlab.org <mailto:fwinter at jlab.org>> wrote:
>
>             I realized that the function parameters had no alignment
>             attributes on them. However, even adding an alignment
>             suitable for aligned loads on YMM, i.e. 32 bytes, didn't
>             convince the vectorizer to use [8 x float].
>
>             define void @main(i64 %lo, i64 %hi, float* noalias align
>             32 %arg0, float* noalias align 32 %arg1, float* noalias
>             align 32 %arg2) {
>             ...
>
>             results still in code using only [4 x float].
>
>             Thanks,
>             Frank
>
>
>             On 07/01/2015 10:51 AM, Frank Winter wrote:
>
>                 I seem to have problem to get the SLP vectorizer to
>                 make use of the full 8 floats available in a SIMD
>                 vector on a Sandy Bridge CPU with AVX. The function is
>                 attached, the CPU flags are:
>
>                 flags        : fpu vme de pse tsc msr pae mce cx8 apic
>                 mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr
>                 sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm
>                 constant_tsc arch_perfmon pebs bts rep_good xtopology
>                 nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor
>                 ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1
>                 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat
>                 epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority
>                 ept vpid
>
>                 I use LLVM 3.6 checked out yesterday
>
>                 ~/toolchain/install/llvm-3.6/bin/opt -datalayout
>                 -basicaa -slp-vectorizer -instcombine <
>                 func_4x4x4_scalar_p_scalar.ll -S
>
>                 the output goes like:
>
>                 ; ModuleID = '<stdin>'
>
>                 define void @main(i64 %lo, i64 %hi, float* noalias
>                 %arg0, float* noalias %arg1, float* noalias %arg2) {
>                 entrypoint:
>                   %0 = bitcast float* %arg1 to <4 x float>*
>                   %1 = load <4 x float>* %0, align 4
>                   %2 = bitcast float* %arg2 to <4 x float>*
>                   %3 = load <4 x float>* %2, align 4
>                   %4 = fadd <4 x float> %3, %1
>                   %5 = bitcast float* %arg0 to <4 x float>*
>                   store <4 x float> %4, <4 x float>* %5, align 4
>                 ....
>
>                 So, it could make use of <8 x float> available in that
>                 machine. But it doesn't. Then I thought, that maybe
>                 the YMM registers get used when lowering the IR to
>                 machine code. However, the generated assembly doesn't
>                 seem to support this assumption :-(
>
>
>                 main:
>                     .cfi_startproc
>                     xorl    %eax, %eax
>                     xorl    %esi, %esi
>                     .align    16, 0x90
>                 .LBB0_1:
>                     vmovups    (%r8,%rax), %xmm0
>                     vaddps    (%rcx,%rax), %xmm0, %xmm0
>                     vmovups    %xmm0, (%rdx,%rax)
>                     addq    $4, %rsi
>                     addq    $16, %rax
>                     cmpq    $61, %rsi
>                     jb    .LBB0_1
>                     retq
>
>                 I played with -mcpu and -march switches without
>                 success. In any case, the target architecture should
>                 be detected with the -datalayout pass, right?
>
>                 Any idea what I am missing?
>
>                 Frank
>
>
>
>                 _______________________________________________
>                 LLVM Developers mailing list
>                 LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
>                 http://llvm.cs.uiuc.edu
>                 http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>             _______________________________________________
>             LLVM Developers mailing list
>             LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
>             http://llvm.cs.uiuc.edu
>             http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>     -- 
>     ----------------------------------------------
>     Dr Frank Winter,               Staff Scientist
>     Thomas Jefferson National Accelerator Facility
>     12000 Jefferson Ave, Newport News, 23606, USA
>     +1-757-269-6448 <tel:%2B1-757-269-6448>, fwinter at jlab.org
>     <mailto:fwinter at jlab.org>
>     ----------------------------------------------
>
>
>     _______________________________________________
>     LLVM Developers mailing list
>     LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
>     http://llvm.cs.uiuc.edu
>     http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>