<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

<meta name="Generator" content="Microsoft Word 14 (filtered medium)">

<style><!--

/* Font Definitions */

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Tahoma;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman","serif";}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

p.MsoAcetate, li.MsoAcetate, div.MsoAcetate

        {mso-style-priority:99;

        mso-style-link:"Balloon Text Char";

        margin:0in;

        margin-bottom:.0001pt;

        font-size:8.0pt;

        font-family:"Tahoma","sans-serif";}

p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph

        {mso-style-priority:34;

        margin-top:0in;

        margin-right:0in;

        margin-bottom:0in;

        margin-left:.5in;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman","serif";}

span.BalloonTextChar

        {mso-style-name:"Balloon Text Char";

        mso-style-priority:99;

        mso-style-link:"Balloon Text";

        font-family:"Tahoma","sans-serif";}

span.EmailStyle20

        {mso-style-type:personal-reply;

        font-family:"Calibri","sans-serif";

        color:#1F497D;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}

@page WordSection1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

</head>

<body lang="EN-US" link="blue" vlink="purple">

<div class="WordSection1">

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">Hi James,<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p> </o:p></span></p>

<div style="border:none;border-left:solid blue 1.5pt;padding:0in 0in 0in 4.0pt">

<div>

<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">

<p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">From:</span></b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif""> James Molloy [mailto:james@jamesmolloy.co.uk]

<br>

<b>Sent:</b> Monday, April 20, 2015 2:03 PM<br>

<b>To:</b> Shahid, Asghar-ahmad; reviews+D9029+public+eee34c83a2c6f996@reviews.llvm.org; hfinkel@anl.gov; spatel@rotateright.com; aschwaighofer@apple.com; ahmed.bougacha@gmail.com<br>

<b>Cc:</b> llvm-commits@cs.uiuc.edu<br>

<b>Subject:</b> Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation<o:p></o:p></span></p>

</div>

</div>

<p class="MsoNormal"><o:p> </o:p></p>

<div>

<p class="MsoNormal">Hi Shahid,<br>

<br>

No matter how horizontal sums are modelled LICM will always have great difficulty recognizing that it can be hoisted. I think this is beyond LICM's abilities, which is why I suggested we allow the Loop Vectorizer to create several different idioms, one of which

 has the sum already hoisted. The Loop Vectorizer has all the knowledge to know that this is legal and profitable to do.<o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">So you mean to create different idioms suited to different targets such as the (1) for X86, (2) for ARM from your example below.<o:p></o:p></span></p>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal">I don't think you're right there with regards signedness. The conceptual operation of SAD is this:<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal">  1. Extend inputs to the output size (i8 -> i32 in PSAD's case)<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">  2. Subtract the inputs<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">  3. abs() the result<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">  4. (optionally) sum the abs()s.<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal">I think steps (2) and (4) are signedness-independent. I think steps (1) and (3) are not, and step (1) is where ARM's SABD differs from UABD.

<span style="color:#1F497D"><o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">I had referred the ARMv8 architecture manual for SABD & UABD, but I could not find any difference in their semantics,<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">both are using the unsigned input data. Is it the right doc to refer?<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p> </o:p></span></p>

<p class="MsoNormal">X86's PSAD also extends the inputs from i8 to i32, so I don't think PSAD is signedness independent either.<o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">I don’t think this is correct, PSAD operate on 8 unsigned byte operands of source and destination.

<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal">Cheers,<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal">James<o:p></o:p></p>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

</div>

</div>

<p class="MsoNormal"><o:p> </o:p></p>

<div>

<p class="MsoNormal">On Mon, 20 Apr 2015 at 07:08 Shahid, Asghar-ahmad <<a href="mailto:Asghar-ahmad.Shahid@amd.com">Asghar-ahmad.Shahid@amd.com</a>> wrote:<o:p></o:p></p>

<div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;margin-left:5.25pt">

<span style="color:#4F81BD">Hi James,</span><o:p></o:p></p>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;margin-left:5.25pt">

<span style="color:#4F81BD"> </span><o:p></o:p></p>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;margin-left:5.25pt">

<span style="color:#4F81BD">Thanks for your explanation.</span><o:p></o:p></p>

</div>

</div>

<div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:10.0pt"> </span><o:p></o:p></p>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:10.0pt">>With regards custom lowering, I think you have misinterpreted me. What I was saying was that if the intrinsic is defined as "sum( abs( *a++ - *b++

 ) )", non-X86 backends could custom lower it as something like "ABD + ADDV" (absolute >difference, sum-of-all-lanes). However, you'd end up with the sum-of-all-lanes unnecessarily being inside the loop! By the time the intrinsic is expanded, it may be difficult

 to determine that the sum can be moved outside the loop.</span><o:p></o:p></p>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:10.0pt"> </span><o:p></o:p></p>

</div>

</div>

<div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;margin-left:5.25pt">

<span style="color:#4F81BD">If the horizontal sum is an intrinsic, how the LICM will happen on it?</span><o:p></o:p></p>

</div>

</div>

<div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;margin-left:5.25pt">

<span style="color:#4F81BD"> </span><o:p></o:p></p>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:10.0pt">>Also, you haven't answered my question about signedness that I've mentioned several times.</span><o:p></o:p></p>

</div>

</div>

<div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;margin-left:5.25pt">

<span style="color:#4F81BD">Currently the SAD intrinsic is modeled for unsigned data types. AFAIK, signedness matters</span><o:p></o:p></p>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;margin-left:5.25pt">

<span style="color:#4F81BD">only when an arithmetic operation results in a carry or overflow. We have not seen a use case</span><o:p></o:p></p>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;margin-left:5.25pt">

<span style="color:#4F81BD">for signed data types yet. Even in case of ARM, “SABD” (if I am correct) does not</span><o:p></o:p></p>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;margin-left:5.25pt">

<span style="color:#4F81BD">set overflow flag.</span><o:p></o:p></p>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;margin-left:5.25pt">

<span style="color:#4F81BD"> </span><o:p></o:p></p>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;margin-left:5.25pt">

<span style="color:#4F81BD">Regards,</span><o:p></o:p></p>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;margin-left:5.25pt">

<span style="color:#4F81BD">Shahid</span><o:p></o:p></p>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;margin-left:5.25pt">

<span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"> </span><o:p></o:p></p>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"> </span><o:p></o:p></p>

<div style="border:none;border-left:solid blue 1.5pt;padding:0in 0in 0in 4.0pt">

<div>

<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">From:</span></b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif""> James Molloy [mailto:<a href="mailto:james@jamesmolloy.co.uk" target="_blank">james@jamesmolloy.co.uk</a>]

<br>

<b>Sent:</b> Thursday, April 16, 2015 9:42 PM<br>

<b>To:</b> Shahid, Asghar-ahmad; <a href="mailto:reviews%2BD9029%2Bpublic%2Beee34c83a2c6f996@reviews.llvm.org" target="_blank">

reviews+D9029+public+eee34c83a2c6f996@reviews.llvm.org</a>; <a href="mailto:hfinkel@anl.gov" target="_blank">

hfinkel@anl.gov</a>; <a href="mailto:spatel@rotateright.com" target="_blank">spatel@rotateright.com</a>;

<a href="mailto:aschwaighofer@apple.com" target="_blank">aschwaighofer@apple.com</a>;

<a href="mailto:ahmed.bougacha@gmail.com" target="_blank">ahmed.bougacha@gmail.com</a></span><o:p></o:p></p>

</div>

</div>

</div>

</div>

</div>

<div>

<div>

<div style="border:none;border-left:solid blue 1.5pt;padding:0in 0in 0in 4.0pt">

<div>

<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif""><br>

<b>Cc:</b> <a href="mailto:llvm-commits@cs.uiuc.edu" target="_blank">llvm-commits@cs.uiuc.edu</a><br>

<b>Subject:</b> Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation</span><o:p></o:p></p>

</div>

</div>

</div>

</div>

</div>

<div>

<div>

<div style="border:none;border-left:solid blue 1.5pt;padding:0in 0in 0in 4.0pt">

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Hi Asghar-ahmad,<br>

<br>

Thanks for responding. I'll try and explain in more detail what I mean. I agree that we can custom lower things and that we could implement your intrinsic on our and other architecture. That is not in question. What is in question is whether the definition

 of the intrinsic and behaviour as-is would allow *efficient* implementation on multiple target architectures.<o:p></o:p></p>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">To reiterate the examples from earlier, there are seemingly two different approaches for lowering a sum-of-absolute-differences loop.  Assume 'a' and 'b' are the two input arrays,

 as some p\<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">ointer to vector type.<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">1:<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">int r = 0;<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">for (i = ...) {<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">  r += sum( abs( *a++ - *b++ ) );<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">}<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">// r holds the sum-of-absolute-differences<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">2:<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">vector int r = {0};<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">for (i = ...) {<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">  r += abs( *a++ - *b++ );<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">}<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">// r holds partial sums.<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">int sad = sum(r);<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">// sad holds the sum-of-absolute-differences<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">The most efficient form of lowering for X86 may possibly be (1), where a PSAD instruction can <span style="font-size:10.0pt">be used (although for non-i8 types perhaps not?). For

 ARM, AArch64 and according to an appnote I found [0] Altivec (I couldn't find anything about MIPS), (2) is going to be better.</span><o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">So the goal as I see it is to define these intrinsics and IR idioms such that both forms can b<span style="font-size:10.0pt">e generated depending on the target (and/or datatype

 - you don't have PSAD for floating point types, so if someone does a non-int SAD loop the most efficient form for you would be (2)).</span><o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

</div>

<div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:10.0pt">With regards custom lowering, I think you have misinterpreted me. What I was saying was that if the intrinsic is defined as "sum( abs( *a++ - *b++

 ) )", non-X86 backends could custom lower it as something like "ABD + ADDV" (absolute difference, sum-of-all-lanes). However, you'd end up with the sum-of-all-lanes unnecessarily being inside the loop! By the time the intrinsic is expanded, it may be difficult

 to determine that the sum can be moved outside the loop.</span><o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:10.0pt">Conversely, if we defined the intrinsic as "abs( *a++ - *b--)", we could still easily generate loop type (1) by adding a sum() around it. As it is

 easier to match a pattern than split a pattern apart and move it around (ISel is made for pattern matching!) this is the implementation I am suggesting.</span><o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:10.0pt">Yes, you're right, this means the name "SAD" for the node may be a misnomer. What I've asked for is the splitting apart of an opaque intrinsic into

 a smaller opaque intrinsic and generic support IR, which is something we do try and do where possible elsewhere in the compiler. I hope I've explained why the node as you've described it may not be useful for any non-X86 target.</span><o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:10.0pt">Also, you haven't answered my question about signedness that I've mentioned several times.</span><o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:10.0pt">Cheers,</span><o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:10.0pt"> </span><o:p></o:p></p>

</div>

</div>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:10.0pt">[0] <a href="http://www.freescale.com/webapp/sps/download/license.jsp?colCode=AVEC_SAD" target="_blank">http://www.freescale.com/webapp/sps/download/license.jsp?colCode=AVEC_SAD</a></span><o:p></o:p></p>

</div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

<div>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">On Thu, 16 Apr 2015 at 12:12 Shahid, Asghar-ahmad <<a href="mailto:Asghar-ahmad.Shahid@amd.com" target="_blank">Asghar-ahmad.Shahid@amd.com</a>> wrote:<o:p></o:p></p>

<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><br>

<br>

> -----Original Message-----<br>

> From: James Molloy [mailto:<a href="mailto:james.molloy@arm.com" target="_blank">james.molloy@arm.com</a>]<br>

> Sent: Thursday, April 16, 2015 3:00 PM<br>

> To: Shahid, Asghar-ahmad; <a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>;

<a href="mailto:spatel@rotateright.com" target="_blank">spatel@rotateright.com</a>;<br>

> <a href="mailto:aschwaighofer@apple.com" target="_blank">aschwaighofer@apple.com</a>;

<a href="mailto:ahmed.bougacha@gmail.com" target="_blank">ahmed.bougacha@gmail.com</a>;<br>

> <a href="mailto:james.molloy@arm.com" target="_blank">james.molloy@arm.com</a><br>

> Cc: <a href="mailto:llvm-commits@cs.uiuc.edu" target="_blank">llvm-commits@cs.uiuc.edu</a><br>

> Subject: Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and<br>

> corresponding ISD::SAD node for "Sum Of Absolute Differences" operation<br>

><br>

> Hi,<br>

><br>

> Thanks for continuing to work on this!<br>

><br>

> From a high level, I still stand by the comments I made on your<br>

> LoopVectorizer review that this is a bit too Intel-specific. ARM and AArch64's<br>

> SAD instructions do not behave as you model here. You model SAD as a per-<br>

> element subtract, abs, then a cross-lane (horizontal) summation. The VABD<br>

> instruction in ARM does not do the last step - that is assumed to be done in a<br>

> separate instruction.<br>

While modeling SAD we wanted to be more generic than being tilted towards any target.<br>

That is why we abstracted the whole semantic (sub, abs, sum) into this intrinsic. Even the<br>

Result we have modeled to be an scalar value.<br>

<br>

><br>

> Now, we can model your intrinsic semantics in ARM using:<br>

><br>

>   val2 = ABD val0, val1<br>

>   val3 = ADDV val2<br>

This can be custom lowered for ARM. At one point X86 also need this kind of custom lowering for v16i8 operand type.<br>

Which I wrongly put into ExpandSAD() and need to be brought back to x86 lowering.<br>

<br>

><br>

> However that isn't very efficient - the sum-across-lanes is expensive. Ideally,<br>

> in a loop, we'd do something like this:<br>

><br>

>   loop:<br>

>     reduction1 = ABD array0[i], array1[i]<br>

>     br loop<br>

>   val = ADDV reduction1<br>

><br>

> So we'd only do the horizontal step once, not on every iteration. That is why I<br>

> think having the intrinsic represent just the per-element operations, not the<br>

> horizontal part, would be the lowest common denominator between our<br>

> two targets. It is easier to pattern match two nodes than to split one node<br>

> apart and LICM part of it.<br>

><br>

> Similarly, the above construction should the loop vectorizer emit it is not very<br>

> good for you. You would never be able to match your PSAD instruction.<br>

We have already matched "psad" both coming from LV & SLP. So I am think this<br>

generalization is ok.<br>

<br>

So<br>

> my suggestion is this:<br>

><br>

> - The intrinsic only models the per-element operations, not the horizontal<br>

> part.<br>

This is more of modeling "absolute diff" than "sum of absolute diff".<br>

<br>

> - The X86 backend pattern matches the sad intrinsic plus a horizontal<br>

> reduction to its PSAD node.<br>

> - The Loop Vectorizer has code to emit *either* the per-element-only or the<br>

> per-element-plus-horizontal SAD as part of its reduction logic.<br>

> - The TargetTransformInfo hook is extended to return an `enum SADType`, in<br>

> much the same way that `popcnt` type is done currently.<br>

><br>

> How does this sound?<br>

With the current modeling I don't see why custom lowering will not be a good solution for any target<br>

Related issues.<br>

<br>

><br>

> Cheers,<br>

><br>

> James<br>

><br>

><br>

> REPOSITORY<br>

>   rL LLVM<br>

><br>

> <a href="http://reviews.llvm.org/D9029" target="_blank">http://reviews.llvm.org/D9029</a><br>

><br>

> EMAIL PREFERENCES<br>

>   <a href="http://reviews.llvm.org/settings/panel/emailpreferences/" target="_blank">http://reviews.llvm.org/settings/panel/emailpreferences/</a><br>

><br>

<br>

<br>

_______________________________________________<br>

llvm-commits mailing list<br>

<a href="mailto:llvm-commits@cs.uiuc.edu" target="_blank">llvm-commits@cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><o:p></o:p></p>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</body>

</html>