[LLVMdev] SLP vectorizer on AVX feature
Frank Winter
fwinter at jlab.org
Wed Jul 1 13:04:53 PDT 2015
Sanjay,
you're right! I used the loop vectorizer in the earlier version.
Increasing the magic number in the SLP vectorizer solved the issue. Now,
the code is vectorized with AVX instructions :-)
Thanks,
Frank
On 07/01/2015 03:53 PM, Sanjay Patel wrote:
> 128-bit wide vectorization is the limit for the SLP vectorizer:
> https://llvm.org/bugs/show_bug.cgi?id=17170#c8
>
> Is it possible that the cases where you saw 256-bit ops were
> transformed by the loop vectorizer rather than the SLP vectorizer?
>
> On Wed, Jul 1, 2015 at 1:18 PM, Frank Winter <fwinter at jlab.org
> <mailto:fwinter at jlab.org>> wrote:
>
> Nadav,
>
> I can check if we have a Haswell CPU somewhere running..
>
> In the meantime I send the link to the debug output of the SLP
> vectorizer. I don't understand all of it quite yet, but it seems
> it's not mentioning the 8-fold vectorization opportunity...
> (please find it here as it's 150KB and slightly over the list
> attachment limit of 100KB
> https://www.dropbox.com/s/aarivrzees30zrj/SLP.txt?dl=0)
>
> Also, in a earlier version of my application I saw on similar
> functions that the SLP vectorizer uses 8xfloat on the same
> hardward (Sandy Bridge). In those versions I used LLVM 3.4 or 3.5
> (trunk).
>
> Thanks,
> Frank
>
>
>
> On 07/01/2015 03:02 PM, Nadav Rotem wrote:
>
> Frank,
>
> It sounds like the SLP vectorizer thinks that it is more
> profitable to use 128bit wide operations (because 256bit
> operations are double pumped on Sandybridge). Did you see a
> different result on Haswell?
>
> Thanks,
> Nadav
>
>
> On Jul 1, 2015, at 11:06 AM, Frank Winter
> <fwinter at jlab.org <mailto:fwinter at jlab.org>> wrote:
>
> I realized that the function parameters had no alignment
> attributes on them. However, even adding an alignment
> suitable for aligned loads on YMM, i.e. 32 bytes, didn't
> convince the vectorizer to use [8 x float].
>
> define void @main(i64 %lo, i64 %hi, float* noalias align
> 32 %arg0, float* noalias align 32 %arg1, float* noalias
> align 32 %arg2) {
> ...
>
> results still in code using only [4 x float].
>
> Thanks,
> Frank
>
>
> On 07/01/2015 10:51 AM, Frank Winter wrote:
>
> I seem to have problem to get the SLP vectorizer to
> make use of the full 8 floats available in a SIMD
> vector on a Sandy Bridge CPU with AVX. The function is
> attached, the CPU flags are:
>
> flags : fpu vme de pse tsc msr pae mce cx8 apic
> mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr
> sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm
> constant_tsc arch_perfmon pebs bts rep_good xtopology
> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor
> ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1
> sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat
> epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority
> ept vpid
>
> I use LLVM 3.6 checked out yesterday
>
> ~/toolchain/install/llvm-3.6/bin/opt -datalayout
> -basicaa -slp-vectorizer -instcombine <
> func_4x4x4_scalar_p_scalar.ll -S
>
> the output goes like:
>
> ; ModuleID = '<stdin>'
>
> define void @main(i64 %lo, i64 %hi, float* noalias
> %arg0, float* noalias %arg1, float* noalias %arg2) {
> entrypoint:
> %0 = bitcast float* %arg1 to <4 x float>*
> %1 = load <4 x float>* %0, align 4
> %2 = bitcast float* %arg2 to <4 x float>*
> %3 = load <4 x float>* %2, align 4
> %4 = fadd <4 x float> %3, %1
> %5 = bitcast float* %arg0 to <4 x float>*
> store <4 x float> %4, <4 x float>* %5, align 4
> ....
>
> So, it could make use of <8 x float> available in that
> machine. But it doesn't. Then I thought, that maybe
> the YMM registers get used when lowering the IR to
> machine code. However, the generated assembly doesn't
> seem to support this assumption :-(
>
>
> main:
> .cfi_startproc
> xorl %eax, %eax
> xorl %esi, %esi
> .align 16, 0x90
> .LBB0_1:
> vmovups (%r8,%rax), %xmm0
> vaddps (%rcx,%rax), %xmm0, %xmm0
> vmovups %xmm0, (%rdx,%rax)
> addq $4, %rsi
> addq $16, %rax
> cmpq $61, %rsi
> jb .LBB0_1
> retq
>
> I played with -mcpu and -march switches without
> success. In any case, the target architecture should
> be detected with the -datalayout pass, right?
>
> Any idea what I am missing?
>
> Frank
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
> http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
> http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> --
> ----------------------------------------------
> Dr Frank Winter, Staff Scientist
> Thomas Jefferson National Accelerator Facility
> 12000 Jefferson Ave, Newport News, 23606, USA
> +1-757-269-6448 <tel:%2B1-757-269-6448>, fwinter at jlab.org
> <mailto:fwinter at jlab.org>
> ----------------------------------------------
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
> http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
More information about the llvm-dev
mailing list