[LLVMdev] Vectorizer using Instruction, not opcodes

Mon Feb 4 11:38:03 PST 2013

On 4 February 2013 18:25, Arnold Schwaighofer <aschwaighofer at apple.com>wrote:

> For cases where this approach breaks really badly we could consider adding
> a specialized api or parameters (like the type of a user/use). But we
> should do so only as a last resort and backed by actual code that would
> benefit from doing so.
>

Very sensible, more or less what I had in mind. I think we could go one
step further and just get some high-level decisions like: is this cast
associated with an (any) arithmetic operation? It won't be perfect, but it
adds a bit more information at very reduced price. Though, this would
require us to pass the Instruction, not the Opcode.

Do you have an example where this happening?

The example below is not stopping the vectorizer, but it does add a lot of
costs where there are none...

** C code:

 int direct (int k) {
  int i;
  int a[256], b[256], c[256];

  for (i=0; i<256; i++){
    a[i] = b[i] * c[i];
  }
  return a[k];
}

** ASM vectorized result:

        adr     r5, .LCPI0_0
        vdup.32 q9, r1
        vld1.64 {d16, d17}, [r5, :128]
        add     r1, r1, #4
        vadd.i32        q8, q9, q8
        cmp     r3, r1
        vmov.32 r5, d16[0]
        add     r6, lr, r5, lsl #2
        add     r7, r2, r5, lsl #2
        vld1.32 {d16, d17}, [r6]
        add     r5, r4, r5, lsl #2
        vld1.32 {d18, d19}, [r7]
        vmul.i32        q8, q9, q8
        vst1.32 {d16, d17}, [r5]
        bne     .LBB0_2

** Vectorized IR (just the loop):

vector.body:                                      ; preds = %vector.body, %
vector.ph
  %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x
i32> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
  %0 = extractelement <4 x i32> %induction, i32 0
  %1 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %0
  %2 = insertelement <4 x i32*> undef, i32* %1, i32 0
  %3 = extractelement <4 x i32> %induction, i32 1
  %4 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %3
  %5 = insertelement <4 x i32*> %2, i32* %4, i32 1
  %6 = extractelement <4 x i32> %induction, i32 2
  %7 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %6
  %8 = insertelement <4 x i32*> %5, i32* %7, i32 2
  %9 = extractelement <4 x i32> %induction, i32 3
  %10 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %9
  %11 = insertelement <4 x i32*> %8, i32* %10, i32 3
  %12 = extractelement <4 x i32> %induction, i32 0
  %13 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %12
  %14 = getelementptr i32* %13, i32 0
  %15 = bitcast i32* %14 to <4 x i32>*
  %wide.load = load <4 x i32>* %15, align 4
  %16 = extractelement <4 x i32> %induction, i32 0
  %17 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %16
  %18 = insertelement <4 x i32*> undef, i32* %17, i32 0
  %19 = extractelement <4 x i32> %induction, i32 1
  %20 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %19
  %21 = insertelement <4 x i32*> %18, i32* %20, i32 1
  %22 = extractelement <4 x i32> %induction, i32 2
  %23 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %22
  %24 = insertelement <4 x i32*> %21, i32* %23, i32 2
  %25 = extractelement <4 x i32> %induction, i32 3
  %26 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %25
  %27 = insertelement <4 x i32*> %24, i32* %26, i32 3
  %28 = extractelement <4 x i32> %induction, i32 0
  %29 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %28
  %30 = getelementptr i32* %29, i32 0
  %31 = bitcast i32* %30 to <4 x i32>*
  %wide.load3 = load <4 x i32>* %31, align 4
  %32 = mul nsw <4 x i32> %wide.load3, %wide.load
  %33 = extractelement <4 x i32> %induction, i32 0
  %34 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %33
  %35 = insertelement <4 x i32*> undef, i32* %34, i32 0
  %36 = extractelement <4 x i32> %induction, i32 1
  %37 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %36
  %38 = insertelement <4 x i32*> %35, i32* %37, i32 1
  %39 = extractelement <4 x i32> %induction, i32 2
  %40 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %39
  %41 = insertelement <4 x i32*> %38, i32* %40, i32 2
  %42 = extractelement <4 x i32> %induction, i32 3
  %43 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %42
  %44 = insertelement <4 x i32*> %41, i32* %43, i32 3
  %45 = extractelement <4 x i32> %induction, i32 0
  %46 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %45
  %47 = getelementptr i32* %46, i32 0
  %48 = bitcast i32* %47 to <4 x i32>*
  store <4 x i32> %32, <4 x i32>* %48, align 4
  %49 = add nsw <4 x i32> %induction, <i32 1, i32 1, i32 1, i32 1>
  %50 = icmp eq <4 x i32> %49, <i32 256, i32 256, i32 256, i32 256>
  %index.next = add i32 %index, 4
  %51 = icmp eq i32 %index.next, %end.idx.rnd.down
  br i1 %51, label %middle.block, label %vector.body

** Cost analysis:

Cost Model: Found an estimated cost of 1 for instruction:   %induction =
add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
Cost Model: Found an estimated cost of 1 for instruction:   %0 =
extractelement <4 x i32> %induction, i32 0
Cost Model: Unknown cost for instruction:   %1 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %0
Cost Model: Found an estimated cost of 1 for instruction:   %2 =
insertelement <4 x i32*> undef, i32* %1, i32 0
Cost Model: Found an estimated cost of 1 for instruction:   %3 =
extractelement <4 x i32> %induction, i32 1
Cost Model: Unknown cost for instruction:   %4 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %3
Cost Model: Found an estimated cost of 1 for instruction:   %5 =
insertelement <4 x i32*> %2, i32* %4, i32 1
Cost Model: Found an estimated cost of 1 for instruction:   %6 =
extractelement <4 x i32> %induction, i32 2
Cost Model: Unknown cost for instruction:   %7 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %6
Cost Model: Found an estimated cost of 1 for instruction:   %8 =
insertelement <4 x i32*> %5, i32* %7, i32 2
Cost Model: Found an estimated cost of 1 for instruction:   %9 =
extractelement <4 x i32> %induction, i32 3
Cost Model: Unknown cost for instruction:   %10 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %9
Cost Model: Found an estimated cost of 1 for instruction:   %11 =
insertelement <4 x i32*> %8, i32* %10, i32 3
Cost Model: Found an estimated cost of 1 for instruction:   %12 =
extractelement <4 x i32> %induction, i32 0
Cost Model: Unknown cost for instruction:   %13 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %12
Cost Model: Unknown cost for instruction:   %14 = getelementptr i32* %13,
i32 0

and so on...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130204/ca2977c3/attachment.html>