[LLVMdev] SLP vectorizer on AVX feature

Sanjay Patel spatel at rotateright.com
Wed Jul 1 12:53:33 PDT 2015


128-bit wide vectorization is the limit for the SLP vectorizer:
https://llvm.org/bugs/show_bug.cgi?id=17170#c8

Is it possible that the cases where you saw 256-bit ops were transformed by
the loop vectorizer rather than the SLP vectorizer?

On Wed, Jul 1, 2015 at 1:18 PM, Frank Winter <fwinter at jlab.org> wrote:

> Nadav,
>
> I can check if we have a Haswell CPU somewhere running..
>
> In the meantime I send the link to the debug output of the SLP vectorizer.
> I don't understand all of it quite yet, but it seems it's not mentioning
> the 8-fold vectorization opportunity... (please find it here as it's 150KB
> and slightly over the list attachment limit of 100KB
> https://www.dropbox.com/s/aarivrzees30zrj/SLP.txt?dl=0)
>
> Also, in a earlier version of my application I saw on similar functions
> that the SLP vectorizer uses 8xfloat on the same hardward (Sandy Bridge).
> In those versions I used LLVM 3.4 or 3.5 (trunk).
>
> Thanks,
> Frank
>
>
>
> On 07/01/2015 03:02 PM, Nadav Rotem wrote:
>
>> Frank,
>>
>> It sounds like the SLP vectorizer thinks that it is more profitable to
>> use 128bit wide operations (because 256bit operations are double pumped on
>> Sandybridge). Did you see a different result on Haswell?
>>
>> Thanks,
>> Nadav
>>
>>
>>  On Jul 1, 2015, at 11:06 AM, Frank Winter <fwinter at jlab.org> wrote:
>>>
>>> I realized that the function parameters had no alignment attributes on
>>> them. However, even adding an alignment suitable for aligned loads on YMM,
>>> i.e. 32 bytes, didn't convince the vectorizer to use [8 x float].
>>>
>>> define void @main(i64 %lo, i64 %hi, float* noalias align 32 %arg0,
>>> float* noalias align 32 %arg1, float* noalias align 32 %arg2) {
>>> ...
>>>
>>> results still in code using only [4 x float].
>>>
>>> Thanks,
>>> Frank
>>>
>>>
>>> On 07/01/2015 10:51 AM, Frank Winter wrote:
>>>
>>>> I seem to have problem to get the SLP vectorizer to make use of the
>>>> full 8 floats available in a SIMD vector on a Sandy Bridge CPU with AVX.
>>>> The function is attached, the CPU flags are:
>>>>
>>>> flags        : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
>>>> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
>>>> pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
>>>> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
>>>> ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm
>>>> ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
>>>>
>>>> I use LLVM 3.6 checked out yesterday
>>>>
>>>> ~/toolchain/install/llvm-3.6/bin/opt -datalayout -basicaa
>>>> -slp-vectorizer -instcombine < func_4x4x4_scalar_p_scalar.ll -S
>>>>
>>>> the output goes like:
>>>>
>>>> ; ModuleID = '<stdin>'
>>>>
>>>> define void @main(i64 %lo, i64 %hi, float* noalias %arg0, float*
>>>> noalias %arg1, float* noalias %arg2) {
>>>> entrypoint:
>>>>   %0 = bitcast float* %arg1 to <4 x float>*
>>>>   %1 = load <4 x float>* %0, align 4
>>>>   %2 = bitcast float* %arg2 to <4 x float>*
>>>>   %3 = load <4 x float>* %2, align 4
>>>>   %4 = fadd <4 x float> %3, %1
>>>>   %5 = bitcast float* %arg0 to <4 x float>*
>>>>   store <4 x float> %4, <4 x float>* %5, align 4
>>>> ....
>>>>
>>>> So, it could make use of <8 x float> available in that machine. But it
>>>> doesn't. Then I thought, that maybe the YMM registers get used when
>>>> lowering the IR to machine code. However, the generated assembly doesn't
>>>> seem to support this assumption :-(
>>>>
>>>>
>>>> main:
>>>>     .cfi_startproc
>>>>     xorl    %eax, %eax
>>>>     xorl    %esi, %esi
>>>>     .align    16, 0x90
>>>> .LBB0_1:
>>>>     vmovups    (%r8,%rax), %xmm0
>>>>     vaddps    (%rcx,%rax), %xmm0, %xmm0
>>>>     vmovups    %xmm0, (%rdx,%rax)
>>>>     addq    $4, %rsi
>>>>     addq    $16, %rax
>>>>     cmpq    $61, %rsi
>>>>     jb    .LBB0_1
>>>>     retq
>>>>
>>>> I played with -mcpu and -march switches without success. In any case,
>>>> the target architecture should be detected with the -datalayout pass, right?
>>>>
>>>> Any idea what I am missing?
>>>>
>>>> Frank
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>
>
> --
> ----------------------------------------------
> Dr Frank Winter,               Staff Scientist
> Thomas Jefferson National Accelerator Facility
> 12000 Jefferson Ave, Newport News, 23606, USA
> +1-757-269-6448, fwinter at jlab.org
> ----------------------------------------------
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150701/c13a4461/attachment.html>


More information about the llvm-dev mailing list