<div dir="ltr"><span style="font-family:arial,sans-serif;font-size:13px">If you'd like to contribute support for this, look at isHorizontalBinOp and go from there. Feel free to ask questions if you get stuck.</span><div class="" style="font-family:arial,sans-serif;font-size:13px"></div><div class="" style="font-family:arial,sans-serif;font-size:13px"><br></div><div class="" style="font-family:arial,sans-serif;font-size:13px">FWIW, I've looked at isHorizontalBinOp for inspiration for matching AArch64 ADDV-and-friends (horizontal reduction operations), and thought it was rather temperamental and noticed it being prone to breaking depending on the exact format of the IR. Given that we don't have a canonical form for reductions, I think it wrong that we expect targets to undo quite complex patterns.</div><div class="" style="font-family:arial,sans-serif;font-size:13px"><br></div><div class="" style="font-family:arial,sans-serif;font-size:13px">The reduction pattern is a log2(n) sequence of shuffles and binops, that are really rather complex. These sort of things should, IMHO, be intrinsics. I chatted with Arnold about this at the devmtg and was going to send a patch to do exactly that in a week or so.</div><div class="" style="font-family:arial,sans-serif;font-size:13px"><br></div><div class="" style="font-family:arial,sans-serif;font-size:13px">Cheers,</div><div class="" style="font-family:arial,sans-serif;font-size:13px"><br></div><div class="" style="font-family:arial,sans-serif;font-size:13px">James</div></div><div class="gmail_extra"><br><div class="gmail_quote">On 11 November 2014 13:35, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">----- Original Message -----<br>

> From: "Dibyendu Das" <<a href="mailto:Dibyendu.Das@amd.com">Dibyendu.Das@amd.com</a>><br>

> To: "Hal Finkel" <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>>, "Renato Golin" <<a href="mailto:renato.golin@linaro.org">renato.golin@linaro.org</a>><br>

> Cc: <a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a><br>

</span><span class="">> Sent: Tuesday, November 4, 2014 12:15:12 PM<br>

> Subject: RE: [LLVMdev] supporting SAD in loop vectorizer<br>

><br>

> Here's the simple SAD code:<br>

> ---------------------------------------------------<br>

>   1 #include <stdlib.h><br>

>   2<br>

>   3 extern int ly,lx;<br>

>   4 int sad_c( unsigned char *pix1, unsigned char *pix2)<br>

>   5 {<br>

>   6             int i_sum = 0;<br>

>   7             for( int x = 0; x < lx; x++ )<br>

>   8                 i_sum += abs( pix1[x] - pix2[x] );<br>

>   9             return i_sum;<br>

>  10 }<br>

>  11<br>

> -----------------------------------------------------<br>

><br>

> The loop vectorizer does vectorize the loop and then unrolls it<br>

> twice. The main body of the loop at the end looks like below where<br>

> we see the icmp, neg select pattern appearing twice.<br>

> Are we saying we pattern match this to PSADBW in target ?<br>

<br>

</span>Yes.<br>

<span class=""><br>

> That seems<br>

> to have some challenges<br>

<br>

</span>It does, but we already have code in the backend that matches other horizontal operations (see isHorizontalBinOp and its callers in lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't be significantly more complicated.<br>

<span class=""><br>

> including the fact that we would need a<br>

> 4-way unroll to use all of 128b PSADBWs. Or am I<br>

> missing something ?<br>

<br>

</span>No, each unrolling will get its own, so you'll get a PSADBW from each time the loop is unrolled. Each unrolling is vectorized in terms of <4 x i32>, and that is the 128 bits you need.<br>

<br>

If you'd like to contribute support for this, look at isHorizontalBinOp and go from there. Feel free to ask questions if you get stuck.<br>

<span class="HOEnZb"><font color="#888888"><br>

 -Hal<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

><br>

> 2783 vector.body:                                      ; preds =<br>

> %vector.body.preheader, %vector.body<br>

> 2784   %index = phi i64 [ %index.next, %vector.body ], [ 0,<br>

> %vector.body.preheader ]<br>

> 2785   %vec.phi = phi <4 x i32> [ %24, %vector.body ], [<br>

> zeroinitializer, %vector.body.preheader ]<br>

> 2786   %vec.phi9 = phi <4 x i32> [ %25, %vector.body ], [<br>

> zeroinitializer, %vector.body.preheader ]<br>

> 2787   %4 = getelementptr inbounds i8* %pix1, i64 %index<br>

> 2788   %5 = bitcast i8* %4 to <4 x i8>*<br>

> 2789   %wide.load = load <4 x i8>* %5, align 1<br>

> 2790   %.sum19 = or i64 %index, 4<br>

> 2791   %6 = getelementptr i8* %pix1, i64 %.sum19<br>

> 2792   %7 = bitcast i8* %6 to <4 x i8>*<br>

> 2793   %wide.load10 = load <4 x i8>* %7, align 1<br>

> 2794   %8 = zext <4 x i8> %wide.load to <4 x i32><br>

> 2795   %9 = zext <4 x i8> %wide.load10 to <4 x i32><br>

> 2796   %10 = getelementptr inbounds i8* %pix2, i64 %index<br>

> 2797   %11 = bitcast i8* %10 to <4 x i8>*<br>

> 2798   %wide.load11 = load <4 x i8>* %11, align 1<br>

> 2799   %.sum20 = or i64 %index, 4<br>

> 2800   %12 = getelementptr i8* %pix2, i64 %.sum20<br>

> 2801   %13 = bitcast i8* %12 to <4 x i8>*<br>

> 2802   %wide.load12 = load <4 x i8>* %13, align 1<br>

> 2803   %14 = zext <4 x i8> %wide.load11 to <4 x i32><br>

> 2804   %15 = zext <4 x i8> %wide.load12 to <4 x i32><br>

> 2805   %16 = sub nsw <4 x i32> %8, %14<br>

> 2806   %17 = sub nsw <4 x i32> %9, %15<br>

> 2807   %18 = icmp sgt <4 x i32> %16, <i32 -1, i32 -1, i32 -1, i32 -1><br>

> 2808   %19 = icmp sgt <4 x i32> %17, <i32 -1, i32 -1, i32 -1, i32 -1><br>

> 2809   %20 = sub <4 x i32> zeroinitializer, %16<br>

> 2810   %21 = sub <4 x i32> zeroinitializer, %17<br>

> 2811   %22 = select <4 x i1> %18, <4 x i32> %16, <4 x i32> %20<br>

> 2812   %23 = select <4 x i1> %19, <4 x i32> %17, <4 x i32> %21<br>

> 2813   %24 = add nsw <4 x i32> %22, %vec.phi<br>

> 2814   %25 = add nsw <4 x i32> %23, %vec.phi9<br>

> 2815   %index.next = add i64 %index, 8<br>

> 2816   %26 = icmp eq i64 %index.next, %n.vec<br>

> 2817   br i1 %26, label %middle.block.loopexit, label %vector.body,<br>

> !llvm.loop !1<br>

> -----------------------------------------------------<br>

><br>

> -----Original Message-----<br>

> From: Hal Finkel [mailto:<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>]<br>

> Sent: Tuesday, November 04, 2014 9:54 PM<br>

> To: Renato Golin<br>

> Cc: <a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a>; Das, Dibyendu<br>

> Subject: Re: [LLVMdev] supporting SAD in loop vectorizer<br>

><br>

> ----- Original Message -----<br>

> > From: "Renato Golin" <<a href="mailto:renato.golin@linaro.org">renato.golin@linaro.org</a>><br>

> > To: "Dibyendu Das" <<a href="mailto:Dibyendu.Das@amd.com">Dibyendu.Das@amd.com</a>><br>

> > Cc: <a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a><br>

> > Sent: Tuesday, November 4, 2014 5:23:30 AM<br>

> > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer<br>

> ><br>

> > On 4 November 2014 11:06, Das, Dibyendu <<a href="mailto:Dibyendu.Das@amd.com">Dibyendu.Das@amd.com</a>><br>

> > wrote:<br>

> > > Is there any plan to support special idioms in the loop<br>

> > > vectorizer<br>

> > > like sum of absolute difference (SAD) ? We see some useful cases<br>

> > > where llvm is losing performance at -O3 due to SADs not being<br>

> > > vectorized (hence PSADBWs not being generated).<br>

> ><br>

> > It's been a while, but this could either be that the legalisation<br>

> > phase is not recognising the reduction or that the cost is not<br>

> > taking<br>

> > into account the lowered abs().<br>

> ><br>

> > What does -debug-only=loop-vectorize say about it?<br>

><br>

> FWIW, I agree, this sounds like a cost-model problem. The<br>

> loop-vectorizer should be able to vectorize the 'icmp; neg; select'<br>

> pattern, and then the backend can pattern-patch that with the<br>

> reduction (which is a series of shuffles and extract_element) into<br>

> the single instruction PSADBW -- we're quite likely missing the<br>

> target code to do that.<br>

><br>

>  -Hal<br>

><br>

> ><br>

> > cheers,<br>

> > --renato<br>

> > _______________________________________________<br>

> > LLVM Developers mailing list<br>

> > <a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

> > <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

> ><br>

><br>

> --<br>

> Hal Finkel<br>

> Assistant Computational Scientist<br>

> Leadership Computing Facility<br>

> Argonne National Laboratory<br>

><br>

<br>

--<br>

Hal Finkel<br>

Assistant Computational Scientist<br>

Leadership Computing Facility<br>

Argonne National Laboratory<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

</div></div></blockquote></div><br></div>