[LLVMdev] SLP vectorizer on AVX feature

Wed Jul 1 12:18:42 PDT 2015

Nadav,

I can check if we have a Haswell CPU somewhere running..

In the meantime I send the link to the debug output of the SLP 
vectorizer. I don't understand all of it quite yet, but it seems it's 
not mentioning the 8-fold vectorization opportunity... (please find it 
here as it's 150KB and slightly over the list attachment limit of 100KB 
https://www.dropbox.com/s/aarivrzees30zrj/SLP.txt?dl=0)

Also, in a earlier version of my application I saw on similar functions 
that the SLP vectorizer uses 8xfloat on the same hardward (Sandy 
Bridge). In those versions I used LLVM 3.4 or 3.5 (trunk).

Thanks,
Frank

On 07/01/2015 03:02 PM, Nadav Rotem wrote:
> Frank,
>
> It sounds like the SLP vectorizer thinks that it is more profitable to use 128bit wide operations (because 256bit operations are double pumped on Sandybridge). Did you see a different result on Haswell?
>
> Thanks,
> Nadav
>
>
>> On Jul 1, 2015, at 11:06 AM, Frank Winter <fwinter at jlab.org> wrote:
>>
>> I realized that the function parameters had no alignment attributes on them. However, even adding an alignment suitable for aligned loads on YMM, i.e. 32 bytes, didn't convince the vectorizer to use [8 x float].
>>
>> define void @main(i64 %lo, i64 %hi, float* noalias align 32 %arg0, float* noalias align 32 %arg1, float* noalias align 32 %arg2) {
>> ...
>>
>> results still in code using only [4 x float].
>>
>> Thanks,
>> Frank
>>
>>
>> On 07/01/2015 10:51 AM, Frank Winter wrote:
>>> I seem to have problem to get the SLP vectorizer to make use of the full 8 floats available in a SIMD vector on a Sandy Bridge CPU with AVX. The function is attached, the CPU flags are:
>>>
>>> flags        : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
>>>
>>> I use LLVM 3.6 checked out yesterday
>>>
>>> ~/toolchain/install/llvm-3.6/bin/opt -datalayout -basicaa -slp-vectorizer -instcombine < func_4x4x4_scalar_p_scalar.ll -S
>>>
>>> the output goes like:
>>>
>>> ; ModuleID = '<stdin>'
>>>
>>> define void @main(i64 %lo, i64 %hi, float* noalias %arg0, float* noalias %arg1, float* noalias %arg2) {
>>> entrypoint:
>>>   %0 = bitcast float* %arg1 to <4 x float>*
>>>   %1 = load <4 x float>* %0, align 4
>>>   %2 = bitcast float* %arg2 to <4 x float>*
>>>   %3 = load <4 x float>* %2, align 4
>>>   %4 = fadd <4 x float> %3, %1
>>>   %5 = bitcast float* %arg0 to <4 x float>*
>>>   store <4 x float> %4, <4 x float>* %5, align 4
>>> ....
>>>
>>> So, it could make use of <8 x float> available in that machine. But it doesn't. Then I thought, that maybe the YMM registers get used when lowering the IR to machine code. However, the generated assembly doesn't seem to support this assumption :-(
>>>
>>>
>>> main:
>>>     .cfi_startproc
>>>     xorl    %eax, %eax
>>>     xorl    %esi, %esi
>>>     .align    16, 0x90
>>> .LBB0_1:
>>>     vmovups    (%r8,%rax), %xmm0
>>>     vaddps    (%rcx,%rax), %xmm0, %xmm0
>>>     vmovups    %xmm0, (%rdx,%rax)
>>>     addq    $4, %rsi
>>>     addq    $16, %rax
>>>     cmpq    $61, %rsi
>>>     jb    .LBB0_1
>>>     retq
>>>
>>> I played with -mcpu and -march switches without success. In any case, the target architecture should be detected with the -datalayout pass, right?
>>>
>>> Any idea what I am missing?
>>>
>>> Frank
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-- 
----------------------------------------------
Dr Frank Winter,               Staff Scientist
Thomas Jefferson National Accelerator Facility
12000 Jefferson Ave, Newport News, 23606, USA
+1-757-269-6448, fwinter at jlab.org
----------------------------------------------