<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<font style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">Thx James and Hal. I will have a look at the HorizontalBinOp and check and get back.
</font><br>
<br>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<font style="font-size:10.0pt;font-family:"Tahoma","sans-serif""><b>From</b>: James Molloy [mailto:james@jamesmolloy.co.uk]
<br>
<b>Sent</b>: Tuesday, November 11, 2014 08:21 AM Central Standard Time<br>
<b>To</b>: Hal Finkel <hfinkel@anl.gov> <br>
<b>Cc</b>: Das, Dibyendu; llvmdev@cs.uiuc.edu <llvmdev@cs.uiuc.edu> <br>
<b>Subject</b>: Re: [LLVMdev] supporting SAD in loop vectorizer <br>
</font> <br>
</div>
<div dir="ltr"><span style="font-family:arial,sans-serif;font-size:13px">If you'd like to contribute support for this, look at isHorizontalBinOp and go from there. Feel free to ask questions if you get stuck.</span>
<div class="" style="font-family:arial,sans-serif;font-size:13px"></div>
<div class="" style="font-family:arial,sans-serif;font-size:13px"><br>
</div>
<div class="" style="font-family:arial,sans-serif;font-size:13px">FWIW, I've looked at isHorizontalBinOp for inspiration for matching AArch64 ADDV-and-friends (horizontal reduction operations), and thought it was rather temperamental and noticed it being prone
to breaking depending on the exact format of the IR. Given that we don't have a canonical form for reductions, I think it wrong that we expect targets to undo quite complex patterns.</div>
<div class="" style="font-family:arial,sans-serif;font-size:13px"><br>
</div>
<div class="" style="font-family:arial,sans-serif;font-size:13px">The reduction pattern is a log2(n) sequence of shuffles and binops, that are really rather complex. These sort of things should, IMHO, be intrinsics. I chatted with Arnold about this at the devmtg
and was going to send a patch to do exactly that in a week or so.</div>
<div class="" style="font-family:arial,sans-serif;font-size:13px"><br>
</div>
<div class="" style="font-family:arial,sans-serif;font-size:13px">Cheers,</div>
<div class="" style="font-family:arial,sans-serif;font-size:13px"><br>
</div>
<div class="" style="font-family:arial,sans-serif;font-size:13px">James</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On 11 November 2014 13:35, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class="">----- Original Message -----<br>
> From: "Dibyendu Das" <<a href="mailto:Dibyendu.Das@amd.com">Dibyendu.Das@amd.com</a>><br>
> To: "Hal Finkel" <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>>, "Renato Golin" <<a href="mailto:renato.golin@linaro.org">renato.golin@linaro.org</a>><br>
> Cc: <a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a><br>
</span><span class="">> Sent: Tuesday, November 4, 2014 12:15:12 PM<br>
> Subject: RE: [LLVMdev] supporting SAD in loop vectorizer<br>
><br>
> Here's the simple SAD code:<br>
> ---------------------------------------------------<br>
> 1 #include <stdlib.h><br>
> 2<br>
> 3 extern int ly,lx;<br>
> 4 int sad_c( unsigned char *pix1, unsigned char *pix2)<br>
> 5 {<br>
> 6 int i_sum = 0;<br>
> 7 for( int x = 0; x < lx; x++ )<br>
> 8 i_sum += abs( pix1[x] - pix2[x] );<br>
> 9 return i_sum;<br>
> 10 }<br>
> 11<br>
> -----------------------------------------------------<br>
><br>
> The loop vectorizer does vectorize the loop and then unrolls it<br>
> twice. The main body of the loop at the end looks like below where<br>
> we see the icmp, neg select pattern appearing twice.<br>
> Are we saying we pattern match this to PSADBW in target ?<br>
<br>
</span>Yes.<br>
<span class=""><br>
> That seems<br>
> to have some challenges<br>
<br>
</span>It does, but we already have code in the backend that matches other horizontal operations (see isHorizontalBinOp and its callers in lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't be significantly more complicated.<br>
<span class=""><br>
> including the fact that we would need a<br>
> 4-way unroll to use all of 128b PSADBWs. Or am I<br>
> missing something ?<br>
<br>
</span>No, each unrolling will get its own, so you'll get a PSADBW from each time the loop is unrolled. Each unrolling is vectorized in terms of <4 x i32>, and that is the 128 bits you need.<br>
<br>
If you'd like to contribute support for this, look at isHorizontalBinOp and go from there. Feel free to ask questions if you get stuck.<br>
<span class="HOEnZb"><font color="#888888"><br>
-Hal<br>
</font></span>
<div class="HOEnZb">
<div class="h5"><br>
><br>
> 2783 vector.body: ; preds =<br>
> %vector.body.preheader, %vector.body<br>
> 2784 %index = phi i64 [ %index.next, %vector.body ], [ 0,<br>
> %vector.body.preheader ]<br>
> 2785 %vec.phi = phi <4 x i32> [ %24, %vector.body ], [<br>
> zeroinitializer, %vector.body.preheader ]<br>
> 2786 %vec.phi9 = phi <4 x i32> [ %25, %vector.body ], [<br>
> zeroinitializer, %vector.body.preheader ]<br>
> 2787 %4 = getelementptr inbounds i8* %pix1, i64 %index<br>
> 2788 %5 = bitcast i8* %4 to <4 x i8>*<br>
> 2789 %wide.load = load <4 x i8>* %5, align 1<br>
> 2790 %.sum19 = or i64 %index, 4<br>
> 2791 %6 = getelementptr i8* %pix1, i64 %.sum19<br>
> 2792 %7 = bitcast i8* %6 to <4 x i8>*<br>
> 2793 %wide.load10 = load <4 x i8>* %7, align 1<br>
> 2794 %8 = zext <4 x i8> %wide.load to <4 x i32><br>
> 2795 %9 = zext <4 x i8> %wide.load10 to <4 x i32><br>
> 2796 %10 = getelementptr inbounds i8* %pix2, i64 %index<br>
> 2797 %11 = bitcast i8* %10 to <4 x i8>*<br>
> 2798 %wide.load11 = load <4 x i8>* %11, align 1<br>
> 2799 %.sum20 = or i64 %index, 4<br>
> 2800 %12 = getelementptr i8* %pix2, i64 %.sum20<br>
> 2801 %13 = bitcast i8* %12 to <4 x i8>*<br>
> 2802 %wide.load12 = load <4 x i8>* %13, align 1<br>
> 2803 %14 = zext <4 x i8> %wide.load11 to <4 x i32><br>
> 2804 %15 = zext <4 x i8> %wide.load12 to <4 x i32><br>
> 2805 %16 = sub nsw <4 x i32> %8, %14<br>
> 2806 %17 = sub nsw <4 x i32> %9, %15<br>
> 2807 %18 = icmp sgt <4 x i32> %16, <i32 -1, i32 -1, i32 -1, i32 -1><br>
> 2808 %19 = icmp sgt <4 x i32> %17, <i32 -1, i32 -1, i32 -1, i32 -1><br>
> 2809 %20 = sub <4 x i32> zeroinitializer, %16<br>
> 2810 %21 = sub <4 x i32> zeroinitializer, %17<br>
> 2811 %22 = select <4 x i1> %18, <4 x i32> %16, <4 x i32> %20<br>
> 2812 %23 = select <4 x i1> %19, <4 x i32> %17, <4 x i32> %21<br>
> 2813 %24 = add nsw <4 x i32> %22, %vec.phi<br>
> 2814 %25 = add nsw <4 x i32> %23, %vec.phi9<br>
> 2815 %index.next = add i64 %index, 8<br>
> 2816 %26 = icmp eq i64 %index.next, %n.vec<br>
> 2817 br i1 %26, label %middle.block.loopexit, label %vector.body,<br>
> !llvm.loop !1<br>
> -----------------------------------------------------<br>
><br>
> -----Original Message-----<br>
> From: Hal Finkel [mailto:<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>]<br>
> Sent: Tuesday, November 04, 2014 9:54 PM<br>
> To: Renato Golin<br>
> Cc: <a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a>; Das, Dibyendu<br>
> Subject: Re: [LLVMdev] supporting SAD in loop vectorizer<br>
><br>
> ----- Original Message -----<br>
> > From: "Renato Golin" <<a href="mailto:renato.golin@linaro.org">renato.golin@linaro.org</a>><br>
> > To: "Dibyendu Das" <<a href="mailto:Dibyendu.Das@amd.com">Dibyendu.Das@amd.com</a>><br>
> > Cc: <a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a><br>
> > Sent: Tuesday, November 4, 2014 5:23:30 AM<br>
> > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer<br>
> ><br>
> > On 4 November 2014 11:06, Das, Dibyendu <<a href="mailto:Dibyendu.Das@amd.com">Dibyendu.Das@amd.com</a>><br>
> > wrote:<br>
> > > Is there any plan to support special idioms in the loop<br>
> > > vectorizer<br>
> > > like sum of absolute difference (SAD) ? We see some useful cases<br>
> > > where llvm is losing performance at -O3 due to SADs not being<br>
> > > vectorized (hence PSADBWs not being generated).<br>
> ><br>
> > It's been a while, but this could either be that the legalisation<br>
> > phase is not recognising the reduction or that the cost is not<br>
> > taking<br>
> > into account the lowered abs().<br>
> ><br>
> > What does -debug-only=loop-vectorize say about it?<br>
><br>
> FWIW, I agree, this sounds like a cost-model problem. The<br>
> loop-vectorizer should be able to vectorize the 'icmp; neg; select'<br>
> pattern, and then the backend can pattern-patch that with the<br>
> reduction (which is a series of shuffles and extract_element) into<br>
> the single instruction PSADBW -- we're quite likely missing the<br>
> target code to do that.<br>
><br>
> -Hal<br>
><br>
> ><br>
> > cheers,<br>
> > --renato<br>
> > _______________________________________________<br>
> > LLVM Developers mailing list<br>
> > <a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>
> > <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>
> ><br>
><br>
> --<br>
> Hal Finkel<br>
> Assistant Computational Scientist<br>
> Leadership Computing Facility<br>
> Argonne National Laboratory<br>
><br>
<br>
--<br>
Hal Finkel<br>
Assistant Computational Scientist<br>
Leadership Computing Facility<br>
Argonne National Laboratory<br>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</body>
</html>