<div dir="ltr">Hi Asghar-ahmad,<br><br>Thanks for responding. I'll try and explain in more detail what I mean. I agree that we can custom lower things and that we could implement your intrinsic on our and other architecture. That is not in question. What is in question is whether the definition of the intrinsic and behaviour as-is would allow *efficient* implementation on multiple target architectures.<div><br></div><div>To reiterate the examples from earlier, there are seemingly two different approaches for lowering a sum-of-absolute-differences loop.  Assume 'a' and 'b' are the two input arrays, as some p\</div><div>ointer to vector type.</div><div><br></div><div>1:</div><div>int r = 0;</div><div>for (i = ...) {</div><div>  r += sum( abs( *a++ - *b++ ) );</div><div>}</div><div>// r holds the sum-of-absolute-differences</div><div><br></div><div>2:</div><div>vector int r = {0};</div><div>for (i = ...) {</div><div>  r += abs( *a++ - *b++ );</div><div>}</div><div>// r holds partial sums.</div><div>int sad = sum(r);</div><div>// sad holds the sum-of-absolute-differences</div><div><br></div><div>The most efficient form of lowering for X86 may possibly be (1), where a PSAD instruction can <span style="line-height:1.5;font-size:13.1999998092651px">be used (although for non-i8 types perhaps not?). For ARM, AArch64 and according to an appnote</span><span style="line-height:1.5;font-size:13.1999998092651px"> I found [0] Altivec (I couldn't find anything about MIPS), (2) is going to be better.</span></div><div><br></div><div>So the goal as I see it is to define these intrinsics and IR idioms such that both forms can b<span style="line-height:1.5;font-size:13.1999998092651px">e generated depending on the target (and/or datatype - you don't have PSAD for floating point </span><span style="line-height:1.5;font-size:13.1999998092651px">types, so if someone does a non-int SAD loop the most efficient form for you would be (2)).</span></div><div><span style="line-height:1.5;font-size:13.1999998092651px"><br></span></div><div><div><span style="font-size:13.1999998092651px;line-height:22px">With regards custom lowering, I think you have misinterpreted me. What I was saying was that if the intrinsic is defined as "sum( abs( *a++ - *b++ ) )", non-X86 backends could custom lower it</span><span style="font-size:13.1999998092651px;line-height:22px"> as something like "ABD + ADDV" (absolute difference, sum-of-all-lanes). However, you'd end up with the sum-of-all-lanes unnecessarily being inside the loop! By the time the intrinsic is expa</span><span style="font-size:13.1999998092651px;line-height:22px">nded, it may be difficult to determine that the sum can be moved outside the loop.</span></div><div><span style="font-size:13.1999998092651px;line-height:22px"><br></span></div><div><span style="font-size:13.1999998092651px;line-height:22px">Conversely, if we defined the intrinsic as "abs( *a++ - *b--)", we could still easily generate loop type (1) by adding a sum() around it. As it is easier to match a pattern than split a patte</span><span style="font-size:13.1999998092651px;line-height:22px">rn apart and move it around (ISel is made for pattern matching!) this is the implementation I am suggesting.</span></div><div><span style="font-size:13.1999998092651px;line-height:22px"><br></span></div><div><span style="font-size:13.1999998092651px;line-height:22px">Yes, you're right, this means the name "SAD" for the node may be a misnomer. What I've asked for is the splitting apart of an opaque intrinsic into a smaller opaque intrinsic and generic supp</span><span style="font-size:13.1999998092651px;line-height:22px">ort IR, which is something we do try and do where possible elsewhere in the compiler. I hope I've explained why the node as you've described it may not be useful for any non-X86 target.</span></div><div><span style="font-size:13.1999998092651px;line-height:22px"><br></span></div><div><span style="font-size:13.1999998092651px;line-height:22px">Also, you haven't answered my question about signedness that I've mentioned several times.</span></div><div><span style="font-size:13.1999998092651px;line-height:22px"><br></span></div><div><span style="font-size:13.1999998092651px;line-height:22px">Cheers,</span></div><div style="font-size:13.1999998092651px;line-height:1.5"><br></div></div><div style="font-size:13.1999998092651px;line-height:1.5">[0] <a href="http://www.freescale.com/webapp/sps/download/license.jsp?colCode=AVEC_SAD">http://www.freescale.com/webapp/sps/download/license.jsp?colCode=AVEC_SAD</a></div><br><div class="gmail_quote">On Thu, 16 Apr 2015 at 12:12 Shahid, Asghar-ahmad <<a href="mailto:Asghar-ahmad.Shahid@amd.com" target="_blank">Asghar-ahmad.Shahid@amd.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

<br>

> -----Original Message-----<br>

> From: James Molloy [mailto:<a href="mailto:james.molloy@arm.com" target="_blank">james.molloy@arm.com</a>]<br>

> Sent: Thursday, April 16, 2015 3:00 PM<br>

> To: Shahid, Asghar-ahmad; <a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>; <a href="mailto:spatel@rotateright.com" target="_blank">spatel@rotateright.com</a>;<br>

> <a href="mailto:aschwaighofer@apple.com" target="_blank">aschwaighofer@apple.com</a>; <a href="mailto:ahmed.bougacha@gmail.com" target="_blank">ahmed.bougacha@gmail.com</a>;<br>

> <a href="mailto:james.molloy@arm.com" target="_blank">james.molloy@arm.com</a><br>

> Cc: <a href="mailto:llvm-commits@cs.uiuc.edu" target="_blank">llvm-commits@cs.uiuc.edu</a><br>

> Subject: Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and<br>

> corresponding ISD::SAD node for "Sum Of Absolute Differences" operation<br>

><br>

> Hi,<br>

><br>

> Thanks for continuing to work on this!<br>

><br>

> From a high level, I still stand by the comments I made on your<br>

> LoopVectorizer review that this is a bit too Intel-specific. ARM and AArch64's<br>

> SAD instructions do not behave as you model here. You model SAD as a per-<br>

> element subtract, abs, then a cross-lane (horizontal) summation. The VABD<br>

> instruction in ARM does not do the last step - that is assumed to be done in a<br>

> separate instruction.<br>

While modeling SAD we wanted to be more generic than being tilted towards any target.<br>

That is why we abstracted the whole semantic (sub, abs, sum) into this intrinsic. Even the<br>

Result we have modeled to be an scalar value.<br>

<br>

><br>

> Now, we can model your intrinsic semantics in ARM using:<br>

><br>

>   val2 = ABD val0, val1<br>

>   val3 = ADDV val2<br>

This can be custom lowered for ARM. At one point X86 also need this kind of custom lowering for v16i8 operand type.<br>

Which I wrongly put into ExpandSAD() and need to be brought back to x86 lowering.<br>

<br>

><br>

> However that isn't very efficient - the sum-across-lanes is expensive. Ideally,<br>

> in a loop, we'd do something like this:<br>

><br>

>   loop:<br>

>     reduction1 = ABD array0[i], array1[i]<br>

>     br loop<br>

>   val = ADDV reduction1<br>

><br>

> So we'd only do the horizontal step once, not on every iteration. That is why I<br>

> think having the intrinsic represent just the per-element operations, not the<br>

> horizontal part, would be the lowest common denominator between our<br>

> two targets. It is easier to pattern match two nodes than to split one node<br>

> apart and LICM part of it.<br>

><br>

> Similarly, the above construction should the loop vectorizer emit it is not very<br>

> good for you. You would never be able to match your PSAD instruction.<br>

We have already matched "psad" both coming from LV & SLP. So I am think this<br>

generalization is ok.<br>

<br>

So<br>

> my suggestion is this:<br>

><br>

> - The intrinsic only models the per-element operations, not the horizontal<br>

> part.<br>

This is more of modeling "absolute diff" than "sum of absolute diff".<br>

<br>

> - The X86 backend pattern matches the sad intrinsic plus a horizontal<br>

> reduction to its PSAD node.<br>

> - The Loop Vectorizer has code to emit *either* the per-element-only or the<br>

> per-element-plus-horizontal SAD as part of its reduction logic.<br>

> - The TargetTransformInfo hook is extended to return an `enum SADType`, in<br>

> much the same way that `popcnt` type is done currently.<br>

><br>

> How does this sound?<br>

With the current modeling I don't see why custom lowering will not be a good solution for any target<br>

Related issues.<br>

<br>

><br>

> Cheers,<br>

><br>

> James<br>

><br>

><br>

> REPOSITORY<br>

>   rL LLVM<br>

><br>

> <a href="http://reviews.llvm.org/D9029" target="_blank">http://reviews.llvm.org/D9029</a><br>

><br>

> EMAIL PREFERENCES<br>

>   <a href="http://reviews.llvm.org/settings/panel/emailpreferences/" target="_blank">http://reviews.llvm.org/settings/panel/emailpreferences/</a><br>

><br>

<br>

<br>

_______________________________________________<br>

llvm-commits mailing list<br>

<a href="mailto:llvm-commits@cs.uiuc.edu" target="_blank">llvm-commits@cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br>

</blockquote></div></div>