[PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation

Shahid, Asghar-ahmad Asghar-ahmad.Shahid at amd.com
Mon Apr 20 09:53:44 PDT 2015


>  Are you sure about that? I thought your psad had this signature:
>  i32 @psad(i8* %a, i8* %b)
Oops, it was a typo, pls read it as “both the operand source1 and source2”. Below is the link for reference.
https://www.yumpu.com/en/document/view/7897009/ia-32-intelr-architecture-software-developers-manual-volume-2/643

Regards,
Shahid


From: James Molloy [mailto:james at jamesmolloy.co.uk]
Sent: Monday, April 20, 2015 10:05 PM
To: Shahid, Asghar-ahmad; reviews+D9029+public+eee34c83a2c6f996 at reviews.llvm.org; hfinkel at anl.gov; spatel at rotateright.com; aschwaighofer at apple.com; ahmed.bougacha at gmail.com
Cc: llvm-commits at cs.uiuc.edu
Subject: Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation

Hi,
So you mean to create different idioms suited to different targets such as the (1) for X86, (2) for ARM from your example below.
Yes; although (2) will apply to more than just ARM - there may be examples where such an idiom is profitable for X86 too, such as when you have i16 datatypes.

>I had referred the ARMv8 architecture manual for SABD & UABD, but I could not find any difference in their semantics,
> both are using the unsigned input data. Is it the right doc to refer?
That is the correct document. There's a knack to reading it: both SABD and UABD have a shared decode and semantic, but if you notice the "unsigned" pseudo-variable is set from the "U" bit of the instruction, which is 0 for UABD and 1 for SABD. So they do treat their input as differently signed.

> I don’t think this is correct, PSAD operate on 8 unsigned byte operands of source and destination.
Are you sure about that? I thought your psad had this signature:
    i32 @psad(i8* %a, i8* %b)

Are you saying instead it has:
    i8 @psad(i8* %a, i8* %b)

Because that contradicts what you've said earlier in these threads and also intuition - what is the point in such an instruction as it can overflow during the summation step?

Cheers,

James

On Mon, 20 Apr 2015 at 17:02 Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com<mailto:Asghar-ahmad.Shahid at amd.com>> wrote:
Hi James,

From: James Molloy [mailto:james at jamesmolloy.co.uk<mailto:james at jamesmolloy.co.uk>]
Sent: Monday, April 20, 2015 2:03 PM

To: Shahid, Asghar-ahmad; reviews+D9029+public+eee34c83a2c6f996 at reviews.llvm.org<mailto:reviews%2BD9029%2Bpublic%2Beee34c83a2c6f996 at reviews.llvm.org>; hfinkel at anl.gov<mailto:hfinkel at anl.gov>; spatel at rotateright.com<mailto:spatel at rotateright.com>; aschwaighofer at apple.com<mailto:aschwaighofer at apple.com>; ahmed.bougacha at gmail.com<mailto:ahmed.bougacha at gmail.com>
Cc: llvm-commits at cs.uiuc.edu<mailto:llvm-commits at cs.uiuc.edu>
Subject: Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation

Hi Shahid,


No matter how horizontal sums are modelled LICM will always have great difficulty recognizing that it can be hoisted. I think this is beyond LICM's abilities, which is why I suggested we allow the Loop Vectorizer to create several different idioms, one of which has the sum already hoisted. The Loop Vectorizer has all the knowledge to know that this is legal and profitable to do.
So you mean to create different idioms suited to different targets such as the (1) for X86, (2) for ARM from your example below.

I don't think you're right there with regards signedness. The conceptual operation of SAD is this:

  1. Extend inputs to the output size (i8 -> i32 in PSAD's case)
  2. Subtract the inputs
  3. abs() the result
  4. (optionally) sum the abs()s.

I think steps (2) and (4) are signedness-independent. I think steps (1) and (3) are not, and step (1) is where ARM's SABD differs from UABD.
I had referred the ARMv8 architecture manual for SABD & UABD, but I could not find any difference in their semantics,
both are using the unsigned input data. Is it the right doc to refer?

X86's PSAD also extends the inputs from i8 to i32, so I don't think PSAD is signedness independent either.
I don’t think this is correct, PSAD operate on 8 unsigned byte operands of source and destination.

Cheers,

James



On Mon, 20 Apr 2015 at 07:08 Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com<mailto:Asghar-ahmad.Shahid at amd.com>> wrote:
Hi James,

Thanks for your explanation.

>With regards custom lowering, I think you have misinterpreted me. What I was saying was that if the intrinsic is defined as "sum( abs( *a++ - *b++ ) )", non-X86 backends could custom lower it as something like "ABD + ADDV" (absolute >difference, sum-of-all-lanes). However, you'd end up with the sum-of-all-lanes unnecessarily being inside the loop! By the time the intrinsic is expanded, it may be difficult to determine that the sum can be moved outside the loop.

If the horizontal sum is an intrinsic, how the LICM will happen on it?

>Also, you haven't answered my question about signedness that I've mentioned several times.
Currently the SAD intrinsic is modeled for unsigned data types. AFAIK, signedness matters
only when an arithmetic operation results in a carry or overflow. We have not seen a use case
for signed data types yet. Even in case of ARM, “SABD” (if I am correct) does not
set overflow flag.

Regards,
Shahid


From: James Molloy [mailto:james at jamesmolloy.co.uk<mailto:james at jamesmolloy.co.uk>]
Sent: Thursday, April 16, 2015 9:42 PM
To: Shahid, Asghar-ahmad; reviews+D9029+public+eee34c83a2c6f996 at reviews.llvm.org<mailto:reviews%2BD9029%2Bpublic%2Beee34c83a2c6f996 at reviews.llvm.org>; hfinkel at anl.gov<mailto:hfinkel at anl.gov>; spatel at rotateright.com<mailto:spatel at rotateright.com>; aschwaighofer at apple.com<mailto:aschwaighofer at apple.com>; ahmed.bougacha at gmail.com<mailto:ahmed.bougacha at gmail.com>

Cc: llvm-commits at cs.uiuc.edu<mailto:llvm-commits at cs.uiuc.edu>
Subject: Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation

Hi Asghar-ahmad,

Thanks for responding. I'll try and explain in more detail what I mean. I agree that we can custom lower things and that we could implement your intrinsic on our and other architecture. That is not in question. What is in question is whether the definition of the intrinsic and behaviour as-is would allow *efficient* implementation on multiple target architectures.

To reiterate the examples from earlier, there are seemingly two different approaches for lowering a sum-of-absolute-differences loop.  Assume 'a' and 'b' are the two input arrays, as some p\
ointer to vector type.

1:
int r = 0;
for (i = ...) {
  r += sum( abs( *a++ - *b++ ) );
}
// r holds the sum-of-absolute-differences

2:
vector int r = {0};
for (i = ...) {
  r += abs( *a++ - *b++ );
}
// r holds partial sums.
int sad = sum(r);
// sad holds the sum-of-absolute-differences

The most efficient form of lowering for X86 may possibly be (1), where a PSAD instruction can be used (although for non-i8 types perhaps not?). For ARM, AArch64 and according to an appnote I found [0] Altivec (I couldn't find anything about MIPS), (2) is going to be better.

So the goal as I see it is to define these intrinsics and IR idioms such that both forms can be generated depending on the target (and/or datatype - you don't have PSAD for floating point types, so if someone does a non-int SAD loop the most efficient form for you would be (2)).

With regards custom lowering, I think you have misinterpreted me. What I was saying was that if the intrinsic is defined as "sum( abs( *a++ - *b++ ) )", non-X86 backends could custom lower it as something like "ABD + ADDV" (absolute difference, sum-of-all-lanes). However, you'd end up with the sum-of-all-lanes unnecessarily being inside the loop! By the time the intrinsic is expanded, it may be difficult to determine that the sum can be moved outside the loop.

Conversely, if we defined the intrinsic as "abs( *a++ - *b--)", we could still easily generate loop type (1) by adding a sum() around it. As it is easier to match a pattern than split a pattern apart and move it around (ISel is made for pattern matching!) this is the implementation I am suggesting.

Yes, you're right, this means the name "SAD" for the node may be a misnomer. What I've asked for is the splitting apart of an opaque intrinsic into a smaller opaque intrinsic and generic support IR, which is something we do try and do where possible elsewhere in the compiler. I hope I've explained why the node as you've described it may not be useful for any non-X86 target.

Also, you haven't answered my question about signedness that I've mentioned several times.

Cheers,

[0] http://www.freescale.com/webapp/sps/download/license.jsp?colCode=AVEC_SAD

On Thu, 16 Apr 2015 at 12:12 Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com<mailto:Asghar-ahmad.Shahid at amd.com>> wrote:


> -----Original Message-----
> From: James Molloy [mailto:james.molloy at arm.com<mailto:james.molloy at arm.com>]
> Sent: Thursday, April 16, 2015 3:00 PM
> To: Shahid, Asghar-ahmad; hfinkel at anl.gov<mailto:hfinkel at anl.gov>; spatel at rotateright.com<mailto:spatel at rotateright.com>;
> aschwaighofer at apple.com<mailto:aschwaighofer at apple.com>; ahmed.bougacha at gmail.com<mailto:ahmed.bougacha at gmail.com>;
> james.molloy at arm.com<mailto:james.molloy at arm.com>
> Cc: llvm-commits at cs.uiuc.edu<mailto:llvm-commits at cs.uiuc.edu>
> Subject: Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and
> corresponding ISD::SAD node for "Sum Of Absolute Differences" operation
>
> Hi,
>
> Thanks for continuing to work on this!
>
> From a high level, I still stand by the comments I made on your
> LoopVectorizer review that this is a bit too Intel-specific. ARM and AArch64's
> SAD instructions do not behave as you model here. You model SAD as a per-
> element subtract, abs, then a cross-lane (horizontal) summation. The VABD
> instruction in ARM does not do the last step - that is assumed to be done in a
> separate instruction.
While modeling SAD we wanted to be more generic than being tilted towards any target.
That is why we abstracted the whole semantic (sub, abs, sum) into this intrinsic. Even the
Result we have modeled to be an scalar value.

>
> Now, we can model your intrinsic semantics in ARM using:
>
>   val2 = ABD val0, val1
>   val3 = ADDV val2
This can be custom lowered for ARM. At one point X86 also need this kind of custom lowering for v16i8 operand type.
Which I wrongly put into ExpandSAD() and need to be brought back to x86 lowering.

>
> However that isn't very efficient - the sum-across-lanes is expensive. Ideally,
> in a loop, we'd do something like this:
>
>   loop:
>     reduction1 = ABD array0[i], array1[i]
>     br loop
>   val = ADDV reduction1
>
> So we'd only do the horizontal step once, not on every iteration. That is why I
> think having the intrinsic represent just the per-element operations, not the
> horizontal part, would be the lowest common denominator between our
> two targets. It is easier to pattern match two nodes than to split one node
> apart and LICM part of it.
>
> Similarly, the above construction should the loop vectorizer emit it is not very
> good for you. You would never be able to match your PSAD instruction.
We have already matched "psad" both coming from LV & SLP. So I am think this
generalization is ok.

So
> my suggestion is this:
>
> - The intrinsic only models the per-element operations, not the horizontal
> part.
This is more of modeling "absolute diff" than "sum of absolute diff".

> - The X86 backend pattern matches the sad intrinsic plus a horizontal
> reduction to its PSAD node.
> - The Loop Vectorizer has code to emit *either* the per-element-only or the
> per-element-plus-horizontal SAD as part of its reduction logic.
> - The TargetTransformInfo hook is extended to return an `enum SADType`, in
> much the same way that `popcnt` type is done currently.
>
> How does this sound?
With the current modeling I don't see why custom lowering will not be a good solution for any target
Related issues.

>
> Cheers,
>
> James
>
>
> REPOSITORY
>   rL LLVM
>
> http://reviews.llvm.org/D9029
>
> EMAIL PREFERENCES
>   http://reviews.llvm.org/settings/panel/emailpreferences/
>


_______________________________________________
llvm-commits mailing list
llvm-commits at cs.uiuc.edu<mailto:llvm-commits at cs.uiuc.edu>
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150420/bf5b2f90/attachment.html>


More information about the llvm-commits mailing list