[LLVMdev] Vectorizer using Instruction, not opcodes
Arnold Schwaighofer
aschwaighofer at apple.com
Mon Feb 4 12:21:04 PST 2013
The loop vectorized does not estimate the cost of vectorization by looking at the IR you list below. It does not vectorize and then run the CostAnalysis pass. It estimates the cost itself before it even performs the vectorization.
The way it works is that it looks at all the scalar instructions and asks: What is the cost if I execute the scalar instruction as a vector instruction. Therefore, it will not consider any of the insert/extractelement instructions you see below (note, that they will all go away any way).
If you run clang++ -O3 -mllvm -debug-only=loop-vectorize you will see which instructions it considers.
E.g:
LV: Found an estimated cost of 0 for VF 1 For instruction: %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx = getelementptr inbounds [256 x i32]* @b, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 1 For instruction: %0 = load i32* %arrayidx, align 4, !tbaa !0
LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx2 = getelementptr inbounds [256 x i32]* @c, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 1 For instruction: %1 = load i32* %arrayidx2, align 4, !tbaa !0
LV: Found an estimated cost of 1 for VF 1 For instruction: %mul = mul nsw i32 %1, %0
LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx4 = getelementptr inbounds [256 x i32]* @a, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 1 For instruction: store i32 %mul, i32* %arrayidx4, align 4, !tbaa !0
LV: Found an estimated cost of 1 for VF 1 For instruction: %indvars.iv.next = add i64 %indvars.iv, 1
LV: Found an estimated cost of 0 for VF 1 For instruction: %lftr.wideiv = trunc i64 %indvars.iv.next to i32
LV: Found an estimated cost of 1 for VF 1 For instruction: %exitcond = icmp ne i32 %lftr.wideiv, 256
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %exitcond, label %for.body, label %for.end
LV: Scalar loop costs: 6.
LV: Found an estimated cost of 0 for VF 2 For instruction: %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
LV: Found an estimated cost of 0 for VF 2 For instruction: %arrayidx = getelementptr inbounds [256 x i32]* @b, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 2 For instruction: %0 = load i32* %arrayidx, align 4, !tbaa !0
LV: Found an estimated cost of 0 for VF 2 For instruction: %arrayidx2 = getelementptr inbounds [256 x i32]* @c, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 2 For instruction: %1 = load i32* %arrayidx2, align 4, !tbaa !0
LV: Found an estimated cost of 2 for VF 2 For instruction: %mul = mul nsw i32 %1, %0
LV: Found an estimated cost of 0 for VF 2 For instruction: %arrayidx4 = getelementptr inbounds [256 x i32]* @a, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 2 For instruction: store i32 %mul, i32* %arrayidx4, align 4, !tbaa !0
LV: Found an estimated cost of 1 for VF 2 For instruction: %indvars.iv.next = add i64 %indvars.iv, 1
LV: Found an estimated cost of 0 for VF 2 For instruction: %lftr.wideiv = trunc i64 %indvars.iv.next to i32
LV: Found an estimated cost of 1 for VF 2 For instruction: %exitcond = icmp ne i32 %lftr.wideiv, 256
LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %exitcond, label %for.body, label %for.end
The corresponding code is in LoopVectorizationCostModel::getInstructionCost(). Basically, it just takes the scalar instruction, its type, creates a vector type and asks: How much would an instruction of that vector type cost (Slight simplification).
On Feb 4, 2013, at 1:38 PM, Renato Golin <renato.golin at linaro.org> wrote:
> The example below is not stopping the vectorizer, but it does add a lot of costs where there are none...
>
> ** C code:
>
> int direct (int k) {
> int i;
> int a[256], b[256], c[256];
>
> for (i=0; i<256; i++){
> a[i] = b[i] * c[i];
> }
> return a[k];
> }
>
> ** ASM vectorized result:
>
> adr r5, .LCPI0_0
> vdup.32 q9, r1
> vld1.64 {d16, d17}, [r5, :128]
> add r1, r1, #4
> vadd.i32 q8, q9, q8
> cmp r3, r1
> vmov.32 r5, d16[0]
> add r6, lr, r5, lsl #2
> add r7, r2, r5, lsl #2
> vld1.32 {d16, d17}, [r6]
> add r5, r4, r5, lsl #2
> vld1.32 {d18, d19}, [r7]
> vmul.i32 q8, q9, q8
> vst1.32 {d16, d17}, [r5]
> bne .LBB0_2
>
> ** Vectorized IR (just the loop):
>
> vector.body: ; preds = %vector.body, %vector.ph
> %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
> %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
> %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
> %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
> %0 = extractelement <4 x i32> %induction, i32 0
> %1 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %0
> %2 = insertelement <4 x i32*> undef, i32* %1, i32 0
> %3 = extractelement <4 x i32> %induction, i32 1
> %4 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %3
> %5 = insertelement <4 x i32*> %2, i32* %4, i32 1
> %6 = extractelement <4 x i32> %induction, i32 2
> %7 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %6
> %8 = insertelement <4 x i32*> %5, i32* %7, i32 2
> %9 = extractelement <4 x i32> %induction, i32 3
> %10 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %9
> %11 = insertelement <4 x i32*> %8, i32* %10, i32 3
> %12 = extractelement <4 x i32> %induction, i32 0
> %13 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %12
> %14 = getelementptr i32* %13, i32 0
> %15 = bitcast i32* %14 to <4 x i32>*
> %wide.load = load <4 x i32>* %15, align 4
> %16 = extractelement <4 x i32> %induction, i32 0
> %17 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %16
> %18 = insertelement <4 x i32*> undef, i32* %17, i32 0
> %19 = extractelement <4 x i32> %induction, i32 1
> %20 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %19
> %21 = insertelement <4 x i32*> %18, i32* %20, i32 1
> %22 = extractelement <4 x i32> %induction, i32 2
> %23 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %22
> %24 = insertelement <4 x i32*> %21, i32* %23, i32 2
> %25 = extractelement <4 x i32> %induction, i32 3
> %26 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %25
> %27 = insertelement <4 x i32*> %24, i32* %26, i32 3
> %28 = extractelement <4 x i32> %induction, i32 0
> %29 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %28
> %30 = getelementptr i32* %29, i32 0
> %31 = bitcast i32* %30 to <4 x i32>*
> %wide.load3 = load <4 x i32>* %31, align 4
> %32 = mul nsw <4 x i32> %wide.load3, %wide.load
> %33 = extractelement <4 x i32> %induction, i32 0
> %34 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %33
> %35 = insertelement <4 x i32*> undef, i32* %34, i32 0
> %36 = extractelement <4 x i32> %induction, i32 1
> %37 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %36
> %38 = insertelement <4 x i32*> %35, i32* %37, i32 1
> %39 = extractelement <4 x i32> %induction, i32 2
> %40 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %39
> %41 = insertelement <4 x i32*> %38, i32* %40, i32 2
> %42 = extractelement <4 x i32> %induction, i32 3
> %43 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %42
> %44 = insertelement <4 x i32*> %41, i32* %43, i32 3
> %45 = extractelement <4 x i32> %induction, i32 0
> %46 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %45
> %47 = getelementptr i32* %46, i32 0
> %48 = bitcast i32* %47 to <4 x i32>*
> store <4 x i32> %32, <4 x i32>* %48, align 4
> %49 = add nsw <4 x i32> %induction, <i32 1, i32 1, i32 1, i32 1>
> %50 = icmp eq <4 x i32> %49, <i32 256, i32 256, i32 256, i32 256>
> %index.next = add i32 %index, 4
> %51 = icmp eq i32 %index.next, %end.idx.rnd.down
> br i1 %51, label %middle.block, label %vector.body
>
> ** Cost analysis:
>
> Cost Model: Found an estimated cost of 1 for instruction: %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
> Cost Model: Found an estimated cost of 1 for instruction: %0 = extractelement <4 x i32> %induction, i32 0
> Cost Model: Unknown cost for instruction: %1 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %0
> Cost Model: Found an estimated cost of 1 for instruction: %2 = insertelement <4 x i32*> undef, i32* %1, i32 0
> Cost Model: Found an estimated cost of 1 for instruction: %3 = extractelement <4 x i32> %induction, i32 1
> Cost Model: Unknown cost for instruction: %4 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %3
> Cost Model: Found an estimated cost of 1 for instruction: %5 = insertelement <4 x i32*> %2, i32* %4, i32 1
> Cost Model: Found an estimated cost of 1 for instruction: %6 = extractelement <4 x i32> %induction, i32 2
> Cost Model: Unknown cost for instruction: %7 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %6
> Cost Model: Found an estimated cost of 1 for instruction: %8 = insertelement <4 x i32*> %5, i32* %7, i32 2
> Cost Model: Found an estimated cost of 1 for instruction: %9 = extractelement <4 x i32> %induction, i32 3
> Cost Model: Unknown cost for instruction: %10 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %9
> Cost Model: Found an estimated cost of 1 for instruction: %11 = insertelement <4 x i32*> %8, i32* %10, i32 3
> Cost Model: Found an estimated cost of 1 for instruction: %12 = extractelement <4 x i32> %induction, i32 0
> Cost Model: Unknown cost for instruction: %13 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %12
> Cost Model: Unknown cost for instruction: %14 = getelementptr i32* %13, i32 0
>
> and so on...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130204/6ce62297/attachment.html>
More information about the llvm-dev
mailing list