<div dir="ltr">On 4 February 2013 18:25, Arnold Schwaighofer <span dir="ltr"><<a href="mailto:aschwaighofer@apple.com" target="_blank">aschwaighofer@apple.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">For cases where this approach breaks really badly we could consider adding a specialized api or parameters (like the type of a user/use). But we should do so only as a last resort and backed by actual code that would benefit from doing so.<br>
</blockquote><div><br></div><div>Very sensible, more or less what I had in mind. I think we could go one step further and just get some high-level decisions like: is this cast associated with an (any) arithmetic operation? It won't be perfect, but it adds a bit more information at very reduced price. Though, this would require us to pass the Instruction, not the Opcode.</div>
<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Do you have an example where this happening?</blockquote>
<div><br></div><div style>The example below is not stopping the vectorizer, but it does add a lot of costs where there are none...</div><div style><br></div><div>** C code:</div><div><br></div><div> int direct (int k) {</div>
<div> int i;</div><div> int a[256], b[256], c[256];</div><div><br></div><div> for (i=0; i<256; i++){</div><div> a[i] = b[i] * c[i];</div>
<div> }</div><div> return a[k];</div><div>}</div><div><br></div><div>** ASM vectorized result:</div><div><br></div><div><div> adr r5, .LCPI0_0</div><div> vdup.32 q9, r1</div><div> vld1.64 {d16, d17}, [r5, :128]</div>
<div> add r1, r1, #4</div><div> vadd.i32 q8, q9, q8</div><div> cmp r3, r1</div><div> vmov.32 r5, d16[0]</div><div> add r6, lr, r5, lsl #2</div><div> add r7, r2, r5, lsl #2</div>
<div> vld1.32 {d16, d17}, [r6]</div><div> add r5, r4, r5, lsl #2</div><div> vld1.32 {d18, d19}, [r7]</div><div> vmul.i32 q8, q9, q8</div><div> vst1.32 {d16, d17}, [r5]</div><div>
bne .LBB0_2</div><div><br></div></div><div>** Vectorized IR (just the loop):<br></div><div><br></div><div><div>vector.body: ; preds = %vector.body, %<a href="http://vector.ph" target="_blank">vector.ph</a></div>
<div> %index = phi i32 [ 0, %<a href="http://vector.ph" target="_blank">vector.ph</a> ], [ %index.next, %vector.body ]</div><div> %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0</div><div>
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer</div>
<div> %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3></div><div> %0 = extractelement <4 x i32> %induction, i32 0</div><div> %1 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %0</div>
<div> %2 = insertelement <4 x i32*> undef, i32* %1, i32 0</div><div> %3 = extractelement <4 x i32> %induction, i32 1</div><div> %4 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %3</div><div> %5 = insertelement <4 x i32*> %2, i32* %4, i32 1</div>
<div> %6 = extractelement <4 x i32> %induction, i32 2</div><div> %7 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %6</div><div> %8 = insertelement <4 x i32*> %5, i32* %7, i32 2</div><div> %9 = extractelement <4 x i32> %induction, i32 3</div>
<div> %10 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %9</div><div> %11 = insertelement <4 x i32*> %8, i32* %10, i32 3</div><div> %12 = extractelement <4 x i32> %induction, i32 0</div><div> %13 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %12</div>
<div> %14 = getelementptr i32* %13, i32 0</div><div> %15 = bitcast i32* %14 to <4 x i32>*</div><div> %wide.load = load <4 x i32>* %15, align 4</div><div><div> %16 = extractelement <4 x i32> %induction, i32 0</div>
<div> %17 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %16</div><div> %18 = insertelement <4 x i32*> undef, i32* %17, i32 0</div><div> %19 = extractelement <4 x i32> %induction, i32 1</div><div> %20 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %19</div>
<div> %21 = insertelement <4 x i32*> %18, i32* %20, i32 1</div><div> %22 = extractelement <4 x i32> %induction, i32 2</div><div> %23 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %22</div><div> %24 = insertelement <4 x i32*> %21, i32* %23, i32 2</div>
<div> %25 = extractelement <4 x i32> %induction, i32 3</div><div> %26 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %25</div><div> %27 = insertelement <4 x i32*> %24, i32* %26, i32 3</div><div> %28 = extractelement <4 x i32> %induction, i32 0</div>
<div> %29 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %28</div><div> %30 = getelementptr i32* %29, i32 0</div><div> %31 = bitcast i32* %30 to <4 x i32>*</div><div> %wide.load3 = load <4 x i32>* %31, align 4</div>
<div> %32 = mul nsw <4 x i32> %wide.load3, %wide.load</div><div> %33 = extractelement <4 x i32> %induction, i32 0</div><div> %34 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %33</div><div> %35 = insertelement <4 x i32*> undef, i32* %34, i32 0</div>
<div> %36 = extractelement <4 x i32> %induction, i32 1</div><div> %37 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %36</div><div> %38 = insertelement <4 x i32*> %35, i32* %37, i32 1</div><div> %39 = extractelement <4 x i32> %induction, i32 2</div>
<div> %40 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %39</div><div> %41 = insertelement <4 x i32*> %38, i32* %40, i32 2</div><div> %42 = extractelement <4 x i32> %induction, i32 3</div><div> %43 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %42</div>
<div> %44 = insertelement <4 x i32*> %41, i32* %43, i32 3</div><div> %45 = extractelement <4 x i32> %induction, i32 0</div><div> %46 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %45</div><div> %47 = getelementptr i32* %46, i32 0</div>
<div> %48 = bitcast i32* %47 to <4 x i32>*</div><div> store <4 x i32> %32, <4 x i32>* %48, align 4</div><div> %49 = add nsw <4 x i32> %induction, <i32 1, i32 1, i32 1, i32 1></div><div> %50 = icmp eq <4 x i32> %49, <i32 256, i32 256, i32 256, i32 256></div>
<div> %index.next = add i32 %index, 4</div><div> %51 = icmp eq i32 %index.next, %end.idx.rnd.down</div><div> br i1 %51, label %middle.block, label %vector.body</div></div><div><br></div><div>** Cost analysis:</div>
<div><br></div><div><div>Cost Model: Found an estimated cost of 1 for instruction: %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3></div><div>Cost Model: Found an estimated cost of 1 for instruction: %0 = extractelement <4 x i32> %induction, i32 0</div>
<div>Cost Model: Unknown cost for instruction: %1 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %0</div><div>Cost Model: Found an estimated cost of 1 for instruction: %2 = insertelement <4 x i32*> undef, i32* %1, i32 0</div>
<div>Cost Model: Found an estimated cost of 1 for instruction: %3 = extractelement <4 x i32> %induction, i32 1</div><div>Cost Model: Unknown cost for instruction: %4 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %3</div>
<div>Cost Model: Found an estimated cost of 1 for instruction: %5 = insertelement <4 x i32*> %2, i32* %4, i32 1</div><div>Cost Model: Found an estimated cost of 1 for instruction: %6 = extractelement <4 x i32> %induction, i32 2</div>
<div>Cost Model: Unknown cost for instruction: %7 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %6</div><div>Cost Model: Found an estimated cost of 1 for instruction: %8 = insertelement <4 x i32*> %5, i32* %7, i32 2</div>
<div>Cost Model: Found an estimated cost of 1 for instruction: %9 = extractelement <4 x i32> %induction, i32 3</div><div>Cost Model: Unknown cost for instruction: %10 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %9</div>
<div>Cost Model: Found an estimated cost of 1 for instruction: %11 = insertelement <4 x i32*> %8, i32* %10, i32 3</div><div>Cost Model: Found an estimated cost of 1 for instruction: %12 = extractelement <4 x i32> %induction, i32 0</div>
<div>Cost Model: Unknown cost for instruction: %13 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %12</div><div>Cost Model: Unknown cost for instruction: %14 = getelementptr i32* %13, i32 0</div><div><br></div>
<div>
and so on...</div></div></div></div></div></div>