[LLVMdev] SLP vectorizer on AVX feature

Wed Jul 1 13:18:25 PDT 2015

Hi Frank,

What does --debug-only=vectorize says?

You may try to get the datalayout and the triple on the IR header,
just to make sure you got everything right. LLVM will honour those,
and front-ends should create them correctly.

--renato

On 1 July 2015 at 19:06, Frank Winter <fwinter at jlab.org> wrote:
> I realized that the function parameters had no alignment attributes on them.
> However, even adding an alignment suitable for aligned loads on YMM, i.e. 32
> bytes, didn't convince the vectorizer to use [8 x float].
>
> define void @main(i64 %lo, i64 %hi, float* noalias align 32 %arg0, float*
> noalias align 32 %arg1, float* noalias align 32 %arg2) {
> ...
>
> results still in code using only [4 x float].
>
> Thanks,
> Frank
>
>
>
> On 07/01/2015 10:51 AM, Frank Winter wrote:
>>
>> I seem to have problem to get the SLP vectorizer to make use of the full 8
>> floats available in a SIMD vector on a Sandy Bridge CPU with AVX. The
>> function is attached, the CPU flags are:
>>
>> flags        : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov
>> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
>> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc
>> aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16
>> xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb
>> xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
>>
>> I use LLVM 3.6 checked out yesterday
>>
>> ~/toolchain/install/llvm-3.6/bin/opt -datalayout -basicaa -slp-vectorizer
>> -instcombine < func_4x4x4_scalar_p_scalar.ll -S
>>
>> the output goes like:
>>
>> ; ModuleID = '<stdin>'
>>
>> define void @main(i64 %lo, i64 %hi, float* noalias %arg0, float* noalias
>> %arg1, float* noalias %arg2) {
>> entrypoint:
>>   %0 = bitcast float* %arg1 to <4 x float>*
>>   %1 = load <4 x float>* %0, align 4
>>   %2 = bitcast float* %arg2 to <4 x float>*
>>   %3 = load <4 x float>* %2, align 4
>>   %4 = fadd <4 x float> %3, %1
>>   %5 = bitcast float* %arg0 to <4 x float>*
>>   store <4 x float> %4, <4 x float>* %5, align 4
>> ....
>>
>> So, it could make use of <8 x float> available in that machine. But it
>> doesn't. Then I thought, that maybe the YMM registers get used when lowering
>> the IR to machine code. However, the generated assembly doesn't seem to
>> support this assumption :-(
>>
>>
>> main:
>>     .cfi_startproc
>>     xorl    %eax, %eax
>>     xorl    %esi, %esi
>>     .align    16, 0x90
>> .LBB0_1:
>>     vmovups    (%r8,%rax), %xmm0
>>     vaddps    (%rcx,%rax), %xmm0, %xmm0
>>     vmovups    %xmm0, (%rdx,%rax)
>>     addq    $4, %rsi
>>     addq    $16, %rax
>>     cmpq    $61, %rsi
>>     jb    .LBB0_1
>>     retq
>>
>> I played with -mcpu and -march switches without success. In any case, the
>> target architecture should be detected with the -datalayout pass, right?
>>
>> Any idea what I am missing?
>>
>> Frank
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev