[LLVMdev] Why is the loop vectorizer not working on my function?

Sat Oct 26 18:42:26 PDT 2013

Hi Arnold,

thanks for the detailed setup. Still, I haven't figured out the right 
thing to do.

I would need only the native target since all generated code will 
execute on the JIT execution machine (right now, the old JIT interface). 
There is no need for other targets.

Maybe it would be good to ask specific questions:

How do I get the triple for the native target?

How do I setup the target transform info pass?

I guess I need the target machine for all of this. 'opt' defines the 
GetTargetMachine() function which takes a Triple (my native triple). It 
calls GetTargetOptions(). This function, as implemented in 'opt', looks 
very static to me and surely doesn't fit all machines. Do I need this 
for my setup?

Frank

On 26/10/13 20:43, Arnold Schwaighofer wrote:
> Hi Frank,
>
>
> On Oct 26, 2013, at 6:29 PM, Frank Winter <fwinter at jlab.org> wrote:
>
>> I would need this to work when calling the vectorizer through
>> the function pass manager. Unfortunately I am having the same
>> problem there:
>
> I am not sure which function pass manager you are referring here. I assume you create your own (you are not using opt but configure your own pass manager)?
>
> Here is what opt does when it sets up its pass pipeline.
>
> You need to have your/the target add the target’s analysis passes:
>
> opt.cpp:
>
>    PassManager Passes;
>
>    … // Add target library info and data layout.
>
>    Triple ModuleTriple(M->getTargetTriple());
>    TargetMachine *Machine = 0;
>    if (ModuleTriple.getArch())
>      Machine = GetTargetMachine(Triple(ModuleTriple));
>    OwningPtr<TargetMachine> TM(Machine);
>
>    // Add internal analysis passes from the target machine.
>    if (TM.get())
>      TM->addAnalysisPasses(Passes); // <<<
>    …
>    TM->addAnalysisPasses(FPManager);
>
> Here is what the target does:
>
> void X86TargetMachine::addAnalysisPasses(PassManagerBase &PM) {
>    // Add first the target-independent BasicTTI pass, then our X86 pass. This
>    // allows the X86 pass to delegate to the target independent layer when
>    // appropriate.
>    PM.add(createBasicTargetTransformInfoPass(this));
>    PM.add(createX86TargetTransformInfoPass(this));
> }
>
>
>
>> LV: The Widest type: 32 bits.
>> LV: The Widest register is: 32 bits.
> This strongly looks like no target has added a target transform info pass so we default to NoTTI.
>
> But, it could also be that you don’t have the right sub target (in which case you need to set the right cpu, “-mcpu” in opt, when the target machine is created):
>
> unsigned X86TTI::getRegisterBitWidth(bool Vector) const {
>    if (Vector) {
>      if (ST->hasAVX()) return 256;
>      if (ST->hasSSE1()) return 128;
>      return 0;
>    }
>
>    if (ST->is64Bit())
>      return 64;
>    return 32;
>
> }
>
>
>
>> It's not picking the target information, although I tried with and
>> without the target triple in the module
>>
>> Any idea what could be wrong?
>>
>> Frank
>>
>>
>> On 26/10/13 15:54, Hal Finkel wrote:
>>> ----- Original Message -----
>>>>>>> LV: The Widest type: 32 bits.
>>>>>>> LV: The Widest register is: 32 bits.
>>>> Yep, we don’t pick up the right TTI.
>>>>
>>>> Try -march=x86-64 (or leave it out) you already have this info in the
>>>> triple.
>>>>
>>>> Then it should work (does for me with your example below).
>>> That may depend on what CPU is picks by default; Frank, if it does not work for you, try specifying a target CPU (-mcpu=whatever).
>>>
>>>   -Hal
>>>
>>>> On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at jlab.org> wrote:
>>>>
>>>>> Hi Hal!
>>>>>
>>>>> I am using the 'x86_64' target. Below the complete module dump and
>>>>> here the command line:
>>>>>
>>>>> opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S
>>>>> test.ll
>>>>>
>>>>> Frank
>>>>>
>>>>>
>>>>> ; ModuleID = 'test.ll'
>>>>>
>>>>> target datalayout =
>>>>> "e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
>>>>> 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
>>>>>
>>>>> target triple = "x86_64-unknown-linux-elf"
>>>>>
>>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>>> entrypoint:
>>>>>   %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>>   %1 = load i32* %0
>>>>>   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>>   %3 = bitcast [8 x i8]* %2 to i32*
>>>>>   %4 = load i32* %3
>>>>>   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>>   %6 = bitcast [8 x i8]* %5 to float**
>>>>>   %7 = load float** %6
>>>>>   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>>   %9 = bitcast [8 x i8]* %8 to float**
>>>>>   %10 = load float** %9
>>>>>   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>>   %12 = bitcast [8 x i8]* %11 to float**
>>>>>   %13 = load float** %12
>>>>>   br label %L0
>>>>>
>>>>> L0:                                               ; preds = %L0,
>>>>> %entrypoint
>>>>>   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>   %15 = getelementptr float* %10, i32 %14
>>>>>   %16 = load float* %15
>>>>>   %17 = getelementptr float* %13, i32 %14
>>>>>   %18 = load float* %17
>>>>>   %19 = fmul float %18, %16
>>>>>   %20 = getelementptr float* %7, i32 %14
>>>>>   store float %19, float* %20
>>>>>   %21 = add nsw i32 %14, 1
>>>>>   %22 = icmp sge i32 %21, %4
>>>>>   br i1 %22, label %L1, label %L0
>>>>>
>>>>> L1:                                               ; preds = %L0
>>>>>   ret void
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> On 26/10/13 15:08, Hal Finkel wrote:
>>>>>> ----- Original Message -----
>>>>>>> Hi Arnold,
>>>>>>>
>>>>>>> adding '-debug-only=loop-vectorize' to the command gives:
>>>>>>>
>>>>>>> LV: Checking a loop in "bar"
>>>>>>> LV: Found a loop: L0
>>>>>>> LV: Found an induction variable.
>>>>>>> LV: Found an unidentified write ptr:   %7 = load float** %6
>>>>>>> LV: Found an unidentified read ptr:   %10 = load float** %9
>>>>>>> LV: Found an unidentified read ptr:   %13 = load float** %12
>>>>>>> LV: We need to do 2 pointer comparisons.
>>>>>>> LV: We can't vectorize because we can't find the array bounds.
>>>>>>> LV: Can't vectorize due to memory conflicts
>>>>>>> LV: Not vectorizing.
>>>>>>>
>>>>>>> It can't find the loop bounds if we use the overflow version of
>>>>>>> add.
>>>>>>> That's a good point. I should mark this addition to not overflow.
>>>>>>>
>>>>>>> When using the non-overflow version I get:
>>>>>>>
>>>>>>> LV: Checking a loop in "bar"
>>>>>>> LV: Found a loop: L0
>>>>>>> LV: Found an induction variable.
>>>>>>> LV: Found an unidentified write ptr:   %7 = load float** %6
>>>>>>> LV: Found an unidentified read ptr:   %10 = load float** %9
>>>>>>> LV: Found an unidentified read ptr:   %13 = load float** %12
>>>>>>> LV: Found a runtime check ptr:  %20 = getelementptr float* %7,
>>>>>>> i32
>>>>>>> %14
>>>>>>> LV: Found a runtime check ptr:  %15 = getelementptr float* %10,
>>>>>>> i32
>>>>>>> %14
>>>>>>> LV: Found a runtime check ptr:  %17 = getelementptr float* %13,
>>>>>>> i32
>>>>>>> %14
>>>>>>> LV: We need to do 2 pointer comparisons.
>>>>>>> LV: We can perform a memory runtime check if needed.
>>>>>>> LV: We need a runtime memory check.
>>>>>>> LV: We can vectorize this loop (with a runtime bound check)!
>>>>>>> LV: Found trip count: 0
>>>>>>> LV: The Widest type: 32 bits.
>>>>>>> LV: The Widest register is: 32 bits.
>>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %14
>>>>>>> =
>>>>>>> phi
>>>>>>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %15
>>>>>>> =
>>>>>>> getelementptr float* %10, i32 %14
>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %16
>>>>>>> =
>>>>>>> load
>>>>>>> float* %15
>>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %17
>>>>>>> =
>>>>>>> getelementptr float* %13, i32 %14
>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %18
>>>>>>> =
>>>>>>> load
>>>>>>> float* %17
>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %19
>>>>>>> =
>>>>>>> fmul
>>>>>>> float %18, %16
>>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %20
>>>>>>> =
>>>>>>> getelementptr float* %7, i32 %14
>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:
>>>>>>>    store
>>>>>>> float
>>>>>>> %19, float* %20
>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %21
>>>>>>> =
>>>>>>> add
>>>>>>> nsw i32 %14, 1
>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %22
>>>>>>> =
>>>>>>> icmp
>>>>>>> sge i32 %21, %4
>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   br
>>>>>>> i1
>>>>>>> %22,
>>>>>>> label %L1, label %L0
>>>>>>> LV: Scalar loop costs: 7.
>>>>>>> LV: Selecting VF = : 1.
>>>>>>> LV: The target has 8 vector registers
>>>>>>> LV(REG): Calculating max register usage:
>>>>>>> LV(REG): At #0 Interval # 0
>>>>>>> LV(REG): At #1 Interval # 1
>>>>>>> LV(REG): At #2 Interval # 2
>>>>>>> LV(REG): At #3 Interval # 2
>>>>>>> LV(REG): At #4 Interval # 3
>>>>>>> LV(REG): At #5 Interval # 3
>>>>>>> LV(REG): At #6 Interval # 2
>>>>>>> LV(REG): At #8 Interval # 1
>>>>>>> LV(REG): At #9 Interval # 1
>>>>>>> LV(REG): Found max usage: 3
>>>>>>> LV(REG): Found invariant usage: 5
>>>>>>> LV(REG): LoopSize: 11
>>>>>>> LV: Vectorization is possible but not beneficial.
>>>>>>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
>>>>>>> LV: Unroll Factor is 1
>>>>>>>
>>>>>>> It's not beneficial? I didn't expect that. Do you have a
>>>>>>> descriptive
>>>>>>> explanation why it's not beneficial?
>>>>>> It looks like the vectorizer is not picking up a TTI
>>>>>> implementation from a target with vector registers (likely,
>>>>>> you're just seeing the basic cost model). For what target is
>>>>>> this?
>>>>>>
>>>>>>   -Hal
>>>>>>
>>>>>>> Frank
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 26/10/13 13:03, Arnold wrote:
>>>>>>>> Hi Frank,
>>>>>>>>
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> My function implements a simple loop:
>>>>>>>>>
>>>>>>>>> void bar( int start, int end, float* A, float* B, float* C)
>>>>>>>>> {
>>>>>>>>>      for (int i=start; i<end;++i)
>>>>>>>>>         A[i] = B[i] * C[i];
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> This looks pretty much like the standard example. However, I
>>>>>>>>> built
>>>>>>>>> the function
>>>>>>>>> with the IRBuilder, thus not coming from C and clang. Also I
>>>>>>>>> changed slightly
>>>>>>>>> the function's signature:
>>>>>>>>>
>>>>>>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>>>>>>> entrypoint:
>>>>>>>>>    %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>>>>>>    %1 = load i32* %0
>>>>>>>>>    %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>>>>>>    %3 = bitcast [8 x i8]* %2 to i32*
>>>>>>>>>    %4 = load i32* %3
>>>>>>>>>    %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>>>>>>    %6 = bitcast [8 x i8]* %5 to float**
>>>>>>>>>    %7 = load float** %6
>>>>>>>>>    %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>>>>>>    %9 = bitcast [8 x i8]* %8 to float**
>>>>>>>>>    %10 = load float** %9
>>>>>>>>>    %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>>>>>>    %12 = bitcast [8 x i8]* %11 to float**
>>>>>>>>>    %13 = load float** %12
>>>>>>>>>    br label %L0
>>>>>>>>>
>>>>>>>>> L0:                                               ; preds =
>>>>>>>>> %L0,
>>>>>>>>> %entrypoint
>>>>>>>>>    %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>>>>>    %15 = getelementptr float* %10, i32 %14
>>>>>>>>>    %16 = load float* %15
>>>>>>>>>    %17 = getelementptr float* %13, i32 %14
>>>>>>>>>    %18 = load float* %17
>>>>>>>>>    %19 = fmul float %18, %16
>>>>>>>>>    %20 = getelementptr float* %7, i32 %14
>>>>>>>>>    store float %19, float* %20
>>>>>>>>>    %21 = add i32 %14, 1
>>>>>>>> Try
>>>>>>>> %21 = add nsw i32 %14, 1
>>>>>>>> instead for no-signed wrapping arithmetic.
>>>>>>>>
>>>>>>>> If that is not working please post the output of opt ...
>>>>>>>> -debug-only=loop-vectorize ...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>    %22 = icmp sge i32 %21, %4
>>>>>>>>>    br i1 %22, label %L1, label %L0
>>>>>>>>>
>>>>>>>>> L1:                                               ; preds = %L0
>>>>>>>>>    ret void
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> As you can see, I use the phi instruction for the loop index. I
>>>>>>>>> notice
>>>>>>>>> that clang prefers stack allocation. So, I am not sure what's
>>>>>>>>> the
>>>>>>>>> problem that the loop vectorizer is not working here.
>>>>>>>>> I tried many things, like specifying an architecture with
>>>>>>>>> vector
>>>>>>>>> units, enforcing the vector width. No success.
>>>>>>>>>
>>>>>>>>> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S
>>>>>>>>> loop.ll
>>>>>>>>>
>>>>>>>>> The only explanation I have is the use of the phi instruction.
>>>>>>>>> Is
>>>>>>>>> this
>>>>>>>>> preventing to vectorize the loop?
>>>>>>>>>
>>>>>>>>> Frank
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> LLVM Developers mailing list
>>>>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>>>

-- 
-----------------------------------------------------------
Dr Frank Winter
Scientific Computing Group
Jefferson Lab, 12000 Jefferson Ave, CEBAF Centre, Room F216
Newport News, VA 23606, USA
Tel: +1-757-269-6448
EMail: fwinter at jlab.org
-----------------------------------------------------------