[LLVMdev] Vectorizer using Instruction, not opcodes

Hal Finkel hfinkel at anl.gov
Mon Feb 4 12:09:53 PST 2013


----- Original Message -----
> From: "Renato Golin" <renato.golin at linaro.org>
> To: "Arnold Schwaighofer" <aschwaighofer at apple.com>
> Cc: "LLVM Dev" <llvmdev at cs.uiuc.edu>, "Nadav Rotem" <nrotem at apple.com>, "Hal Finkel" <hfinkel at anl.gov>
> Sent: Monday, February 4, 2013 1:38:03 PM
> Subject: Re: Vectorizer using Instruction, not opcodes
> 
> 
> On 4 February 2013 18:25, Arnold Schwaighofer <
> aschwaighofer at apple.com > wrote:
> 
> 
> 
> 
> For cases where this approach breaks really badly we could consider
> adding a specialized api or parameters (like the type of a
> user/use). But we should do so only as a last resort and backed by
> actual code that would benefit from doing so.
> 
> 
> 
> Very sensible, more or less what I had in mind. I think we could go
> one step further and just get some high-level decisions like: is
> this cast associated with an (any) arithmetic operation? It won't be
> perfect, but it adds a bit more information at very reduced price.
> Though, this would require us to pass the Instruction, not the
> Opcode.
> 
> 

We probably don't want to end up in a situation where we're speculatively creating instructions just to test their cost using the cost model. For simple things, passing an existing instruction and its widening factor will work, but I worry that won't really be sufficiently general. Would a tree of <opcode, return-type, input-type>s work? How deep would it need to be? We also need to decide on some self-consistent way of dealing with cost folding. What I mean is that if an instruction can be folded with its use, do we register that cost savings for the use or the operand? We'd also need to pass hasOneUse() information.

 -Hal

> 
> 
> 
> Do you have an example where this happening?
> 
> 
> The example below is not stopping the vectorizer, but it does add a
> lot of costs where there are none...
> 
> 
> ** C code:
> 
> 
> int direct (int k) {
> int i;
> int a[256], b[256], c[256];
> 
> 
> for (i=0; i<256; i++){
> a[i] = b[i] * c[i];
> }
> return a[k];
> }
> 
> 
> ** ASM vectorized result:
> 
> 
> 
> adr r5, .LCPI0_0
> vdup.32 q9, r1
> vld1.64 {d16, d17}, [r5, :128]
> add r1, r1, #4
> vadd.i32 q8, q9, q8
> cmp r3, r1
> vmov.32 r5, d16[0]
> add r6, lr, r5, lsl #2
> add r7, r2, r5, lsl #2
> vld1.32 {d16, d17}, [r6]
> add r5, r4, r5, lsl #2
> vld1.32 {d18, d19}, [r7]
> vmul.i32 q8, q9, q8
> vst1.32 {d16, d17}, [r5]
> bne .LBB0_2
> 
> 
> ** Vectorized IR (just the loop):
> 
> 
> 
> 
> vector.body: ; preds = %vector.body, % vector.ph
> %index = phi i32 [ 0, % vector.ph ], [ %index.next, %vector.body ]
> %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index,
> i32 0
> %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4
> x i32> undef, <4 x i32> zeroinitializer
> %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2,
> i32 3>
> %0 = extractelement <4 x i32> %induction, i32 0
> %1 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %0
> %2 = insertelement <4 x i32*> undef, i32* %1, i32 0
> %3 = extractelement <4 x i32> %induction, i32 1
> %4 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %3
> %5 = insertelement <4 x i32*> %2, i32* %4, i32 1
> %6 = extractelement <4 x i32> %induction, i32 2
> %7 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %6
> %8 = insertelement <4 x i32*> %5, i32* %7, i32 2
> %9 = extractelement <4 x i32> %induction, i32 3
> %10 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %9
> %11 = insertelement <4 x i32*> %8, i32* %10, i32 3
> %12 = extractelement <4 x i32> %induction, i32 0
> %13 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %12
> %14 = getelementptr i32* %13, i32 0
> %15 = bitcast i32* %14 to <4 x i32>*
> %wide.load = load <4 x i32>* %15, align 4
> 
> %16 = extractelement <4 x i32> %induction, i32 0
> %17 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %16
> %18 = insertelement <4 x i32*> undef, i32* %17, i32 0
> %19 = extractelement <4 x i32> %induction, i32 1
> %20 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %19
> %21 = insertelement <4 x i32*> %18, i32* %20, i32 1
> %22 = extractelement <4 x i32> %induction, i32 2
> %23 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %22
> %24 = insertelement <4 x i32*> %21, i32* %23, i32 2
> %25 = extractelement <4 x i32> %induction, i32 3
> %26 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %25
> %27 = insertelement <4 x i32*> %24, i32* %26, i32 3
> %28 = extractelement <4 x i32> %induction, i32 0
> %29 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %28
> %30 = getelementptr i32* %29, i32 0
> %31 = bitcast i32* %30 to <4 x i32>*
> %wide.load3 = load <4 x i32>* %31, align 4
> %32 = mul nsw <4 x i32> %wide.load3, %wide.load
> %33 = extractelement <4 x i32> %induction, i32 0
> %34 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %33
> %35 = insertelement <4 x i32*> undef, i32* %34, i32 0
> %36 = extractelement <4 x i32> %induction, i32 1
> %37 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %36
> %38 = insertelement <4 x i32*> %35, i32* %37, i32 1
> %39 = extractelement <4 x i32> %induction, i32 2
> %40 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %39
> %41 = insertelement <4 x i32*> %38, i32* %40, i32 2
> %42 = extractelement <4 x i32> %induction, i32 3
> %43 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %42
> %44 = insertelement <4 x i32*> %41, i32* %43, i32 3
> %45 = extractelement <4 x i32> %induction, i32 0
> %46 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %45
> %47 = getelementptr i32* %46, i32 0
> %48 = bitcast i32* %47 to <4 x i32>*
> store <4 x i32> %32, <4 x i32>* %48, align 4
> %49 = add nsw <4 x i32> %induction, <i32 1, i32 1, i32 1, i32 1>
> %50 = icmp eq <4 x i32> %49, <i32 256, i32 256, i32 256, i32 256>
> %index.next = add i32 %index, 4
> %51 = icmp eq i32 %index.next, %end.idx.rnd.down
> br i1 %51, label %middle.block, label %vector.body
> 
> 
> ** Cost analysis:
> 
> 
> 
> Cost Model: Found an estimated cost of 1 for instruction: %induction
> = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
> Cost Model: Found an estimated cost of 1 for instruction: %0 =
> extractelement <4 x i32> %induction, i32 0
> Cost Model: Unknown cost for instruction: %1 = getelementptr inbounds
> [256 x i32]* %b, i32 0, i32 %0
> Cost Model: Found an estimated cost of 1 for instruction: %2 =
> insertelement <4 x i32*> undef, i32* %1, i32 0
> Cost Model: Found an estimated cost of 1 for instruction: %3 =
> extractelement <4 x i32> %induction, i32 1
> Cost Model: Unknown cost for instruction: %4 = getelementptr inbounds
> [256 x i32]* %b, i32 0, i32 %3
> Cost Model: Found an estimated cost of 1 for instruction: %5 =
> insertelement <4 x i32*> %2, i32* %4, i32 1
> Cost Model: Found an estimated cost of 1 for instruction: %6 =
> extractelement <4 x i32> %induction, i32 2
> Cost Model: Unknown cost for instruction: %7 = getelementptr inbounds
> [256 x i32]* %b, i32 0, i32 %6
> Cost Model: Found an estimated cost of 1 for instruction: %8 =
> insertelement <4 x i32*> %5, i32* %7, i32 2
> Cost Model: Found an estimated cost of 1 for instruction: %9 =
> extractelement <4 x i32> %induction, i32 3
> Cost Model: Unknown cost for instruction: %10 = getelementptr
> inbounds [256 x i32]* %b, i32 0, i32 %9
> Cost Model: Found an estimated cost of 1 for instruction: %11 =
> insertelement <4 x i32*> %8, i32* %10, i32 3
> Cost Model: Found an estimated cost of 1 for instruction: %12 =
> extractelement <4 x i32> %induction, i32 0
> Cost Model: Unknown cost for instruction: %13 = getelementptr
> inbounds [256 x i32]* %b, i32 0, i32 %12
> Cost Model: Unknown cost for instruction: %14 = getelementptr i32*
> %13, i32 0
> 
> 
> and so on...



More information about the llvm-dev mailing list