<div dir="ltr">On 4 February 2013 18:25, Arnold Schwaighofer <span dir="ltr"><<a href="mailto:aschwaighofer@apple.com" target="_blank">aschwaighofer@apple.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote">


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">For cases where this approach breaks really badly we could consider adding a specialized api or parameters (like the type of a user/use). But we should do so only as a last resort and backed by actual code that would benefit from doing so.<br>


</blockquote><div><br></div><div>Very sensible, more or less what I had in mind. I think we could go one step further and just get some high-level decisions like: is this cast associated with an (any) arithmetic operation? It won't be perfect, but it adds a bit more information at very reduced price. Though, this would require us to pass the Instruction, not the Opcode.</div>


<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Do you have an example where this happening?</blockquote>


<div><br></div><div style>The example below is not stopping the vectorizer, but it does add a lot of costs where there are none...</div><div style><br></div><div>** C code:</div><div><br></div><div> int direct (int k) {</div>

<div>  int i;</div><div>  int a[256], b[256], c[256];</div><div><br></div><div>  for (i=0; i<256; i++){</div><div>    a[i] = b[i] * c[i];</div>

<div>  }</div><div>  return a[k];</div><div>}</div><div><br></div><div>** ASM vectorized result:</div><div><br></div><div><div>        adr     r5, .LCPI0_0</div><div>        vdup.32 q9, r1</div><div>        vld1.64 {d16, d17}, [r5, :128]</div>


<div>        add     r1, r1, #4</div><div>        vadd.i32        q8, q9, q8</div><div>        cmp     r3, r1</div><div>        vmov.32 r5, d16[0]</div><div>        add     r6, lr, r5, lsl #2</div><div>        add     r7, r2, r5, lsl #2</div>


<div>        vld1.32 {d16, d17}, [r6]</div><div>        add     r5, r4, r5, lsl #2</div><div>        vld1.32 {d18, d19}, [r7]</div><div>        vmul.i32        q8, q9, q8</div><div>        vst1.32 {d16, d17}, [r5]</div><div>


        bne     .LBB0_2</div><div><br></div></div><div>** Vectorized IR (just the loop):<br></div><div><br></div><div><div>vector.body:                                      ; preds = %vector.body, %<a href="http://vector.ph" target="_blank">vector.ph</a></div>


<div>  %index = phi i32 [ 0, %<a href="http://vector.ph" target="_blank">vector.ph</a> ], [ %index.next, %vector.body ]</div><div>  %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0</div><div>

  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer</div>

<div>  %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3></div><div>  %0 = extractelement <4 x i32> %induction, i32 0</div><div>  %1 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %0</div>


<div>  %2 = insertelement <4 x i32*> undef, i32* %1, i32 0</div><div>  %3 = extractelement <4 x i32> %induction, i32 1</div><div>  %4 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %3</div><div>  %5 = insertelement <4 x i32*> %2, i32* %4, i32 1</div>


<div>  %6 = extractelement <4 x i32> %induction, i32 2</div><div>  %7 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %6</div><div>  %8 = insertelement <4 x i32*> %5, i32* %7, i32 2</div><div>  %9 = extractelement <4 x i32> %induction, i32 3</div>


<div>  %10 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %9</div><div>  %11 = insertelement <4 x i32*> %8, i32* %10, i32 3</div><div>  %12 = extractelement <4 x i32> %induction, i32 0</div><div>  %13 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %12</div>


<div>  %14 = getelementptr i32* %13, i32 0</div><div>  %15 = bitcast i32* %14 to <4 x i32>*</div><div>  %wide.load = load <4 x i32>* %15, align 4</div><div><div>  %16 = extractelement <4 x i32> %induction, i32 0</div>


<div>  %17 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %16</div><div>  %18 = insertelement <4 x i32*> undef, i32* %17, i32 0</div><div>  %19 = extractelement <4 x i32> %induction, i32 1</div><div>  %20 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %19</div>


<div>  %21 = insertelement <4 x i32*> %18, i32* %20, i32 1</div><div>  %22 = extractelement <4 x i32> %induction, i32 2</div><div>  %23 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %22</div><div>  %24 = insertelement <4 x i32*> %21, i32* %23, i32 2</div>


<div>  %25 = extractelement <4 x i32> %induction, i32 3</div><div>  %26 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %25</div><div>  %27 = insertelement <4 x i32*> %24, i32* %26, i32 3</div><div>  %28 = extractelement <4 x i32> %induction, i32 0</div>


<div>  %29 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %28</div><div>  %30 = getelementptr i32* %29, i32 0</div><div>  %31 = bitcast i32* %30 to <4 x i32>*</div><div>  %wide.load3 = load <4 x i32>* %31, align 4</div>


<div>  %32 = mul nsw <4 x i32> %wide.load3, %wide.load</div><div>  %33 = extractelement <4 x i32> %induction, i32 0</div><div>  %34 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %33</div><div>  %35 = insertelement <4 x i32*> undef, i32* %34, i32 0</div>


<div>  %36 = extractelement <4 x i32> %induction, i32 1</div><div>  %37 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %36</div><div>  %38 = insertelement <4 x i32*> %35, i32* %37, i32 1</div><div>  %39 = extractelement <4 x i32> %induction, i32 2</div>


<div>  %40 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %39</div><div>  %41 = insertelement <4 x i32*> %38, i32* %40, i32 2</div><div>  %42 = extractelement <4 x i32> %induction, i32 3</div><div>  %43 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %42</div>


<div>  %44 = insertelement <4 x i32*> %41, i32* %43, i32 3</div><div>  %45 = extractelement <4 x i32> %induction, i32 0</div><div>  %46 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %45</div><div>  %47 = getelementptr i32* %46, i32 0</div>


<div>  %48 = bitcast i32* %47 to <4 x i32>*</div><div>  store <4 x i32> %32, <4 x i32>* %48, align 4</div><div>  %49 = add nsw <4 x i32> %induction, <i32 1, i32 1, i32 1, i32 1></div><div>  %50 = icmp eq <4 x i32> %49, <i32 256, i32 256, i32 256, i32 256></div>


<div>  %index.next = add i32 %index, 4</div><div>  %51 = icmp eq i32 %index.next, %end.idx.rnd.down</div><div>  br i1 %51, label %middle.block, label %vector.body</div></div><div><br></div><div>** Cost analysis:</div>

<div><br></div><div><div>Cost Model: Found an estimated cost of 1 for instruction:   %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3></div><div>Cost Model: Found an estimated cost of 1 for instruction:   %0 = extractelement <4 x i32> %induction, i32 0</div>


<div>Cost Model: Unknown cost for instruction:   %1 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %0</div><div>Cost Model: Found an estimated cost of 1 for instruction:   %2 = insertelement <4 x i32*> undef, i32* %1, i32 0</div>


<div>Cost Model: Found an estimated cost of 1 for instruction:   %3 = extractelement <4 x i32> %induction, i32 1</div><div>Cost Model: Unknown cost for instruction:   %4 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %3</div>


<div>Cost Model: Found an estimated cost of 1 for instruction:   %5 = insertelement <4 x i32*> %2, i32* %4, i32 1</div><div>Cost Model: Found an estimated cost of 1 for instruction:   %6 = extractelement <4 x i32> %induction, i32 2</div>


<div>Cost Model: Unknown cost for instruction:   %7 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %6</div><div>Cost Model: Found an estimated cost of 1 for instruction:   %8 = insertelement <4 x i32*> %5, i32* %7, i32 2</div>


<div>Cost Model: Found an estimated cost of 1 for instruction:   %9 = extractelement <4 x i32> %induction, i32 3</div><div>Cost Model: Unknown cost for instruction:   %10 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %9</div>


<div>Cost Model: Found an estimated cost of 1 for instruction:   %11 = insertelement <4 x i32*> %8, i32* %10, i32 3</div><div>Cost Model: Found an estimated cost of 1 for instruction:   %12 = extractelement <4 x i32> %induction, i32 0</div>


<div>Cost Model: Unknown cost for instruction:   %13 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %12</div><div>Cost Model: Unknown cost for instruction:   %14 = getelementptr i32* %13, i32 0</div><div><br></div>

<div>

and so on...</div></div></div></div></div></div>