[LLVMdev] Why is the loop vectorizer not working on my function?

Sat Oct 26 20:26:14 PDT 2013

Frank,

I am not familiar with the JITs.

Basically, somehow you need to create the TargetMachine. There is the TargetRegistry that can be queried for the Target (for example, using TargetRegistry::lookupTarget) using the triple/march. The Target can then create a target machine for you.
The Triple is typically picked up from the IR file (Module->getTargetTriple()), and can be overridden using -march,-mtriple.

I think the (old) JIT engine just uses the triple of the host llvm was compiled on (sys::getProcessTriple(), see lib/ExecutionEngine/TargetSelect.cpp).

After you have the target you use it to create the target machine:

  Target *Target = … lookup the target
  TargetMachine *TM = Target->createTargetMachine( …);

Using this target machine you can then add the target transform info pass to a pass manager by calling:

  TM->addAnalysisPasses(FunPassManager); // This will add the target’s “TargetTransformInfo” implementation if it has one.

On Oct 26, 2013, at 8:42 PM, Frank Winter <fwinter at jlab.org> wrote:

> Hi Arnold,
> 
> thanks for the detailed setup. Still, I haven't figured out the right thing to do.
> 
> I would need only the native target since all generated code will execute on the JIT execution machine (right now, the old JIT interface). There is no need for other targets.
> 

I am not familiar with the JIT infrastructure I am afraid. I think the old JIT’s native target is just the host that llvm was compiled on.

> Maybe it would be good to ask specific questions:
> 
> How do I get the triple for the native target?

> How do I setup the target transform info pass?

> 
> I guess I need the target machine for all of this. 'opt' defines the GetTargetMachine() function which takes a Triple (my native triple). It calls GetTargetOptions(). This function, as implemented in 'opt', looks very static to me and surely doesn't fit all machines. Do I need this for my setup?

Yes you need the target machine. There ought to be a JIT api that should return the target machine (since the JIT needs to create it when it creates machine code). But since I am not familiar with the JITs I am afraid I can’t help you further than this.

> 
> Frank
> 
> 
> On 26/10/13 20:43, Arnold Schwaighofer wrote:
>> Hi Frank,
>> 
>> 
>> On Oct 26, 2013, at 6:29 PM, Frank Winter <fwinter at jlab.org> wrote:
>> 
>>> I would need this to work when calling the vectorizer through
>>> the function pass manager. Unfortunately I am having the same
>>> problem there:
>> 
>> I am not sure which function pass manager you are referring here. I assume you create your own (you are not using opt but configure your own pass manager)?
>> 
>> Here is what opt does when it sets up its pass pipeline.
>> 
>> You need to have your/the target add the target’s analysis passes:
>> 
>> opt.cpp:
>> 
>>   PassManager Passes;
>> 
>>   … // Add target library info and data layout.
>> 
>>   Triple ModuleTriple(M->getTargetTriple());
>>   TargetMachine *Machine = 0;
>>   if (ModuleTriple.getArch())
>>     Machine = GetTargetMachine(Triple(ModuleTriple));
>>   OwningPtr<TargetMachine> TM(Machine);
>> 
>>   // Add internal analysis passes from the target machine.
>>   if (TM.get())
>>     TM->addAnalysisPasses(Passes); // <<<
>>   …
>>   TM->addAnalysisPasses(FPManager);
>> 
>> Here is what the target does:
>> 
>> void X86TargetMachine::addAnalysisPasses(PassManagerBase &PM) {
>>   // Add first the target-independent BasicTTI pass, then our X86 pass. This
>>   // allows the X86 pass to delegate to the target independent layer when
>>   // appropriate.
>>   PM.add(createBasicTargetTransformInfoPass(this));
>>   PM.add(createX86TargetTransformInfoPass(this));
>> }
>> 
>> 
>> 
>>> LV: The Widest type: 32 bits.
>>> LV: The Widest register is: 32 bits.
>> This strongly looks like no target has added a target transform info pass so we default to NoTTI.
>> 
>> But, it could also be that you don’t have the right sub target (in which case you need to set the right cpu, “-mcpu” in opt, when the target machine is created):
>> 
>> unsigned X86TTI::getRegisterBitWidth(bool Vector) const {
>>   if (Vector) {
>>     if (ST->hasAVX()) return 256;
>>     if (ST->hasSSE1()) return 128;
>>     return 0;
>>   }
>> 
>>   if (ST->is64Bit())
>>     return 64;
>>   return 32;
>> 
>> }
>> 
>> 
>> 
>>> It's not picking the target information, although I tried with and
>>> without the target triple in the module
>>> 
>>> Any idea what could be wrong?
>>> 
>>> Frank
>>> 
>>> 
>>> On 26/10/13 15:54, Hal Finkel wrote:
>>>> ----- Original Message -----
>>>>>>>> LV: The Widest type: 32 bits.
>>>>>>>> LV: The Widest register is: 32 bits.
>>>>> Yep, we don’t pick up the right TTI.
>>>>> 
>>>>> Try -march=x86-64 (or leave it out) you already have this info in the
>>>>> triple.
>>>>> 
>>>>> Then it should work (does for me with your example below).
>>>> That may depend on what CPU is picks by default; Frank, if it does not work for you, try specifying a target CPU (-mcpu=whatever).
>>>> 
>>>>  -Hal
>>>> 
>>>>> On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at jlab.org> wrote:
>>>>> 
>>>>>> Hi Hal!
>>>>>> 
>>>>>> I am using the 'x86_64' target. Below the complete module dump and
>>>>>> here the command line:
>>>>>> 
>>>>>> opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S
>>>>>> test.ll
>>>>>> 
>>>>>> Frank
>>>>>> 
>>>>>> 
>>>>>> ; ModuleID = 'test.ll'
>>>>>> 
>>>>>> target datalayout =
>>>>>> "e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
>>>>>> 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
>>>>>> 
>>>>>> target triple = "x86_64-unknown-linux-elf"
>>>>>> 
>>>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>>>> entrypoint:
>>>>>>  %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>>>  %1 = load i32* %0
>>>>>>  %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>>>  %3 = bitcast [8 x i8]* %2 to i32*
>>>>>>  %4 = load i32* %3
>>>>>>  %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>>>  %6 = bitcast [8 x i8]* %5 to float**
>>>>>>  %7 = load float** %6
>>>>>>  %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>>>  %9 = bitcast [8 x i8]* %8 to float**
>>>>>>  %10 = load float** %9
>>>>>>  %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>>>  %12 = bitcast [8 x i8]* %11 to float**
>>>>>>  %13 = load float** %12
>>>>>>  br label %L0
>>>>>> 
>>>>>> L0:                                               ; preds = %L0,
>>>>>> %entrypoint
>>>>>>  %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>>  %15 = getelementptr float* %10, i32 %14
>>>>>>  %16 = load float* %15
>>>>>>  %17 = getelementptr float* %13, i32 %14
>>>>>>  %18 = load float* %17
>>>>>>  %19 = fmul float %18, %16
>>>>>>  %20 = getelementptr float* %7, i32 %14
>>>>>>  store float %19, float* %20
>>>>>>  %21 = add nsw i32 %14, 1
>>>>>>  %22 = icmp sge i32 %21, %4
>>>>>>  br i1 %22, label %L1, label %L0
>>>>>> 
>>>>>> L1:                                               ; preds = %L0
>>>>>>  ret void
>>>>>> }
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 26/10/13 15:08, Hal Finkel wrote:
>>>>>>> ----- Original Message -----
>>>>>>>> Hi Arnold,
>>>>>>>> 
>>>>>>>> adding '-debug-only=loop-vectorize' to the command gives:
>>>>>>>> 
>>>>>>>> LV: Checking a loop in "bar"
>>>>>>>> LV: Found a loop: L0
>>>>>>>> LV: Found an induction variable.
>>>>>>>> LV: Found an unidentified write ptr:   %7 = load float** %6
>>>>>>>> LV: Found an unidentified read ptr:   %10 = load float** %9
>>>>>>>> LV: Found an unidentified read ptr:   %13 = load float** %12
>>>>>>>> LV: We need to do 2 pointer comparisons.
>>>>>>>> LV: We can't vectorize because we can't find the array bounds.
>>>>>>>> LV: Can't vectorize due to memory conflicts
>>>>>>>> LV: Not vectorizing.
>>>>>>>> 
>>>>>>>> It can't find the loop bounds if we use the overflow version of
>>>>>>>> add.
>>>>>>>> That's a good point. I should mark this addition to not overflow.
>>>>>>>> 
>>>>>>>> When using the non-overflow version I get:
>>>>>>>> 
>>>>>>>> LV: Checking a loop in "bar"
>>>>>>>> LV: Found a loop: L0
>>>>>>>> LV: Found an induction variable.
>>>>>>>> LV: Found an unidentified write ptr:   %7 = load float** %6
>>>>>>>> LV: Found an unidentified read ptr:   %10 = load float** %9
>>>>>>>> LV: Found an unidentified read ptr:   %13 = load float** %12
>>>>>>>> LV: Found a runtime check ptr:  %20 = getelementptr float* %7,
>>>>>>>> i32
>>>>>>>> %14
>>>>>>>> LV: Found a runtime check ptr:  %15 = getelementptr float* %10,
>>>>>>>> i32
>>>>>>>> %14
>>>>>>>> LV: Found a runtime check ptr:  %17 = getelementptr float* %13,
>>>>>>>> i32
>>>>>>>> %14
>>>>>>>> LV: We need to do 2 pointer comparisons.
>>>>>>>> LV: We can perform a memory runtime check if needed.
>>>>>>>> LV: We need a runtime memory check.
>>>>>>>> LV: We can vectorize this loop (with a runtime bound check)!
>>>>>>>> LV: Found trip count: 0
>>>>>>>> LV: The Widest type: 32 bits.
>>>>>>>> LV: The Widest register is: 32 bits.
>>>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %14
>>>>>>>> =
>>>>>>>> phi
>>>>>>>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %15
>>>>>>>> =
>>>>>>>> getelementptr float* %10, i32 %14
>>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %16
>>>>>>>> =
>>>>>>>> load
>>>>>>>> float* %15
>>>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %17
>>>>>>>> =
>>>>>>>> getelementptr float* %13, i32 %14
>>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %18
>>>>>>>> =
>>>>>>>> load
>>>>>>>> float* %17
>>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %19
>>>>>>>> =
>>>>>>>> fmul
>>>>>>>> float %18, %16
>>>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %20
>>>>>>>> =
>>>>>>>> getelementptr float* %7, i32 %14
>>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:
>>>>>>>>   store
>>>>>>>> float
>>>>>>>> %19, float* %20
>>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %21
>>>>>>>> =
>>>>>>>> add
>>>>>>>> nsw i32 %14, 1
>>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %22
>>>>>>>> =
>>>>>>>> icmp
>>>>>>>> sge i32 %21, %4
>>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   br
>>>>>>>> i1
>>>>>>>> %22,
>>>>>>>> label %L1, label %L0
>>>>>>>> LV: Scalar loop costs: 7.
>>>>>>>> LV: Selecting VF = : 1.
>>>>>>>> LV: The target has 8 vector registers
>>>>>>>> LV(REG): Calculating max register usage:
>>>>>>>> LV(REG): At #0 Interval # 0
>>>>>>>> LV(REG): At #1 Interval # 1
>>>>>>>> LV(REG): At #2 Interval # 2
>>>>>>>> LV(REG): At #3 Interval # 2
>>>>>>>> LV(REG): At #4 Interval # 3
>>>>>>>> LV(REG): At #5 Interval # 3
>>>>>>>> LV(REG): At #6 Interval # 2
>>>>>>>> LV(REG): At #8 Interval # 1
>>>>>>>> LV(REG): At #9 Interval # 1
>>>>>>>> LV(REG): Found max usage: 3
>>>>>>>> LV(REG): Found invariant usage: 5
>>>>>>>> LV(REG): LoopSize: 11
>>>>>>>> LV: Vectorization is possible but not beneficial.
>>>>>>>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
>>>>>>>> LV: Unroll Factor is 1
>>>>>>>> 
>>>>>>>> It's not beneficial? I didn't expect that. Do you have a
>>>>>>>> descriptive
>>>>>>>> explanation why it's not beneficial?
>>>>>>> It looks like the vectorizer is not picking up a TTI
>>>>>>> implementation from a target with vector registers (likely,
>>>>>>> you're just seeing the basic cost model). For what target is
>>>>>>> this?
>>>>>>> 
>>>>>>>  -Hal
>>>>>>> 
>>>>>>>> Frank
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 26/10/13 13:03, Arnold wrote:
>>>>>>>>> Hi Frank,
>>>>>>>>> 
>>>>>>>>> Sent from my iPhone
>>>>>>>>> 
>>>>>>>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> My function implements a simple loop:
>>>>>>>>>> 
>>>>>>>>>> void bar( int start, int end, float* A, float* B, float* C)
>>>>>>>>>> {
>>>>>>>>>>     for (int i=start; i<end;++i)
>>>>>>>>>>        A[i] = B[i] * C[i];
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> This looks pretty much like the standard example. However, I
>>>>>>>>>> built
>>>>>>>>>> the function
>>>>>>>>>> with the IRBuilder, thus not coming from C and clang. Also I
>>>>>>>>>> changed slightly
>>>>>>>>>> the function's signature:
>>>>>>>>>> 
>>>>>>>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>>>>>>>> entrypoint:
>>>>>>>>>>   %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>>>>>>>   %1 = load i32* %0
>>>>>>>>>>   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>>>>>>>   %3 = bitcast [8 x i8]* %2 to i32*
>>>>>>>>>>   %4 = load i32* %3
>>>>>>>>>>   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>>>>>>>   %6 = bitcast [8 x i8]* %5 to float**
>>>>>>>>>>   %7 = load float** %6
>>>>>>>>>>   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>>>>>>>   %9 = bitcast [8 x i8]* %8 to float**
>>>>>>>>>>   %10 = load float** %9
>>>>>>>>>>   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>>>>>>>   %12 = bitcast [8 x i8]* %11 to float**
>>>>>>>>>>   %13 = load float** %12
>>>>>>>>>>   br label %L0
>>>>>>>>>> 
>>>>>>>>>> L0:                                               ; preds =
>>>>>>>>>> %L0,
>>>>>>>>>> %entrypoint
>>>>>>>>>>   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>>>>>>   %15 = getelementptr float* %10, i32 %14
>>>>>>>>>>   %16 = load float* %15
>>>>>>>>>>   %17 = getelementptr float* %13, i32 %14
>>>>>>>>>>   %18 = load float* %17
>>>>>>>>>>   %19 = fmul float %18, %16
>>>>>>>>>>   %20 = getelementptr float* %7, i32 %14
>>>>>>>>>>   store float %19, float* %20
>>>>>>>>>>   %21 = add i32 %14, 1
>>>>>>>>> Try
>>>>>>>>> %21 = add nsw i32 %14, 1
>>>>>>>>> instead for no-signed wrapping arithmetic.
>>>>>>>>> 
>>>>>>>>> If that is not working please post the output of opt ...
>>>>>>>>> -debug-only=loop-vectorize ...
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>   %22 = icmp sge i32 %21, %4
>>>>>>>>>>   br i1 %22, label %L1, label %L0
>>>>>>>>>> 
>>>>>>>>>> L1:                                               ; preds = %L0
>>>>>>>>>>   ret void
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> As you can see, I use the phi instruction for the loop index. I
>>>>>>>>>> notice
>>>>>>>>>> that clang prefers stack allocation. So, I am not sure what's
>>>>>>>>>> the
>>>>>>>>>> problem that the loop vectorizer is not working here.
>>>>>>>>>> I tried many things, like specifying an architecture with
>>>>>>>>>> vector
>>>>>>>>>> units, enforcing the vector width. No success.
>>>>>>>>>> 
>>>>>>>>>> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S
>>>>>>>>>> loop.ll
>>>>>>>>>> 
>>>>>>>>>> The only explanation I have is the use of the phi instruction.
>>>>>>>>>> Is
>>>>>>>>>> this
>>>>>>>>>> preventing to vectorize the loop?
>>>>>>>>>> 
>>>>>>>>>> Frank
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> LLVM Developers mailing list
>>>>>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>>>> _______________________________________________
>>>>>>>> LLVM Developers mailing list
>>>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>>>> 
> 
> 
> -- 
> -----------------------------------------------------------
> Dr Frank Winter
> Scientific Computing Group
> Jefferson Lab, 12000 Jefferson Ave, CEBAF Centre, Room F216
> Newport News, VA 23606, USA
> Tel: +1-757-269-6448
> EMail: fwinter at jlab.org
> -----------------------------------------------------------
>