[PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation

Mon Apr 20 09:34:36 PDT 2015

Hi,

So you mean to create different idioms suited to different targets such as
the (1) for X86, (2) for ARM from your example below.
Yes; although (2) will apply to more than just ARM - there may be examples
where such an idiom is profitable for X86 too, such as when you have i16
datatypes.

>I had referred the ARMv8 architecture manual for SABD & UABD, but I could
not find any difference in their semantics,

> both are using the unsigned input data. Is it the right doc to refer?
That is the correct document. There's a knack to reading it: both SABD and
UABD have a shared decode and semantic, but if you notice the "unsigned"
pseudo-variable is set from the "U" bit of the instruction, which is 0 for
UABD and 1 for SABD. So they do treat their input as differently signed.

> I don’t think this is correct, PSAD operate on 8 unsigned byte operands
of source and destination.
Are you sure about that? I thought your psad had this signature:
    i32 @psad(i8* %a, i8* %b)

Are you saying instead it has:
    i8 @psad(i8* %a, i8* %b)

Because that contradicts what you've said earlier in these threads and also
intuition - what is the point in such an instruction as it can overflow
during the summation step?

Cheers,

James

On Mon, 20 Apr 2015 at 17:02 Shahid, Asghar-ahmad <
Asghar-ahmad.Shahid at amd.com> wrote:

>  Hi James,
>
>
>
> *From:* James Molloy [mailto:james at jamesmolloy.co.uk]
> *Sent:* Monday, April 20, 2015 2:03 PM
>
>
> *To:* Shahid, Asghar-ahmad;
> reviews+D9029+public+eee34c83a2c6f996 at reviews.llvm.org; hfinkel at anl.gov;
> spatel at rotateright.com; aschwaighofer at apple.com; ahmed.bougacha at gmail.com
> *Cc:* llvm-commits at cs.uiuc.edu
> *Subject:* Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and
> corresponding ISD::SAD node for "Sum Of Absolute Differences" operation
>
>
>
> Hi Shahid,
>
>
>
> No matter how horizontal sums are modelled LICM will always have great
> difficulty recognizing that it can be hoisted. I think this is beyond
> LICM's abilities, which is why I suggested we allow the Loop Vectorizer to
> create several different idioms, one of which has the sum already hoisted.
> The Loop Vectorizer has all the knowledge to know that this is legal and
> profitable to do.
>
> So you mean to create different idioms suited to different targets such as
> the (1) for X86, (2) for ARM from your example below.
>
>
>
> I don't think you're right there with regards signedness. The conceptual
> operation of SAD is this:
>
>
>
>   1. Extend inputs to the output size (i8 -> i32 in PSAD's case)
>
>   2. Subtract the inputs
>
>   3. abs() the result
>
>   4. (optionally) sum the abs()s.
>
>
>
> I think steps (2) and (4) are signedness-independent. I think steps (1)
> and (3) are not, and step (1) is where ARM's SABD differs from UABD.
>
> I had referred the ARMv8 architecture manual for SABD & UABD, but I could
> not find any difference in their semantics,
>
> both are using the unsigned input data. Is it the right doc to refer?
>
>
>
> X86's PSAD also extends the inputs from i8 to i32, so I don't think PSAD
> is signedness independent either.
>
> I don’t think this is correct, PSAD operate on 8 unsigned byte operands of
> source and destination.
>
>
>
> Cheers,
>
>
>
> James
>
>
>
>
>
>
>
> On Mon, 20 Apr 2015 at 07:08 Shahid, Asghar-ahmad <
> Asghar-ahmad.Shahid at amd.com> wrote:
>
> Hi James,
>
>
>
> Thanks for your explanation.
>
>
>
> >With regards custom lowering, I think you have misinterpreted me. What I
> was saying was that if the intrinsic is defined as "sum( abs( *a++ - *b++ )
> )", non-X86 backends could custom lower it as something like "ABD + ADDV"
> (absolute >difference, sum-of-all-lanes). However, you'd end up with the
> sum-of-all-lanes unnecessarily being inside the loop! By the time the
> intrinsic is expanded, it may be difficult to determine that the sum can be
> moved outside the loop.
>
>
>
> If the horizontal sum is an intrinsic, how the LICM will happen on it?
>
>
>
> >Also, you haven't answered my question about signedness that I've
> mentioned several times.
>
> Currently the SAD intrinsic is modeled for unsigned data types. AFAIK,
> signedness matters
>
> only when an arithmetic operation results in a carry or overflow. We have
> not seen a use case
>
> for signed data types yet. Even in case of ARM, “SABD” (if I am correct)
> does not
>
> set overflow flag.
>
>
>
> Regards,
>
> Shahid
>
>
>
>
>
> *From:* James Molloy [mailto:james at jamesmolloy.co.uk]
> *Sent:* Thursday, April 16, 2015 9:42 PM
> *To:* Shahid, Asghar-ahmad;
> reviews+D9029+public+eee34c83a2c6f996 at reviews.llvm.org; hfinkel at anl.gov;
> spatel at rotateright.com; aschwaighofer at apple.com; ahmed.bougacha at gmail.com
>
>
> *Cc:* llvm-commits at cs.uiuc.edu
> *Subject:* Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and
> corresponding ISD::SAD node for "Sum Of Absolute Differences" operation
>
>
>
> Hi Asghar-ahmad,
>
> Thanks for responding. I'll try and explain in more detail what I mean. I
> agree that we can custom lower things and that we could implement your
> intrinsic on our and other architecture. That is not in question. What is
> in question is whether the definition of the intrinsic and behaviour as-is
> would allow *efficient* implementation on multiple target architectures.
>
>
>
> To reiterate the examples from earlier, there are seemingly two different
> approaches for lowering a sum-of-absolute-differences loop.  Assume 'a' and
> 'b' are the two input arrays, as some p\
>
> ointer to vector type.
>
>
>
> 1:
>
> int r = 0;
>
> for (i = ...) {
>
>   r += sum( abs( *a++ - *b++ ) );
>
> }
>
> // r holds the sum-of-absolute-differences
>
>
>
> 2:
>
> vector int r = {0};
>
> for (i = ...) {
>
>   r += abs( *a++ - *b++ );
>
> }
>
> // r holds partial sums.
>
> int sad = sum(r);
>
> // sad holds the sum-of-absolute-differences
>
>
>
> The most efficient form of lowering for X86 may possibly be (1), where a
> PSAD instruction can be used (although for non-i8 types perhaps not?).
> For ARM, AArch64 and according to an appnote I found [0] Altivec (I
> couldn't find anything about MIPS), (2) is going to be better.
>
>
>
> So the goal as I see it is to define these intrinsics and IR idioms such
> that both forms can be generated depending on the target (and/or datatype
> - you don't have PSAD for floating point types, so if someone does a
> non-int SAD loop the most efficient form for you would be (2)).
>
>
>
> With regards custom lowering, I think you have misinterpreted me. What I
> was saying was that if the intrinsic is defined as "sum( abs( *a++ - *b++ )
> )", non-X86 backends could custom lower it as something like "ABD + ADDV"
> (absolute difference, sum-of-all-lanes). However, you'd end up with the
> sum-of-all-lanes unnecessarily being inside the loop! By the time the
> intrinsic is expanded, it may be difficult to determine that the sum can be
> moved outside the loop.
>
>
>
> Conversely, if we defined the intrinsic as "abs( *a++ - *b--)", we could
> still easily generate loop type (1) by adding a sum() around it. As it is
> easier to match a pattern than split a pattern apart and move it around
> (ISel is made for pattern matching!) this is the implementation I am
> suggesting.
>
>
>
> Yes, you're right, this means the name "SAD" for the node may be a
> misnomer. What I've asked for is the splitting apart of an opaque intrinsic
> into a smaller opaque intrinsic and generic support IR, which is something
> we do try and do where possible elsewhere in the compiler. I hope I've
> explained why the node as you've described it may not be useful for any
> non-X86 target.
>
>
>
> Also, you haven't answered my question about signedness that I've
> mentioned several times.
>
>
>
> Cheers,
>
>
>
> [0]
> http://www.freescale.com/webapp/sps/download/license.jsp?colCode=AVEC_SAD
>
>
>
> On Thu, 16 Apr 2015 at 12:12 Shahid, Asghar-ahmad <
> Asghar-ahmad.Shahid at amd.com> wrote:
>
>
>
> > -----Original Message-----
> > From: James Molloy [mailto:james.molloy at arm.com]
> > Sent: Thursday, April 16, 2015 3:00 PM
> > To: Shahid, Asghar-ahmad; hfinkel at anl.gov; spatel at rotateright.com;
> > aschwaighofer at apple.com; ahmed.bougacha at gmail.com;
> > james.molloy at arm.com
> > Cc: llvm-commits at cs.uiuc.edu
> > Subject: Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and
> > corresponding ISD::SAD node for "Sum Of Absolute Differences" operation
> >
> > Hi,
> >
> > Thanks for continuing to work on this!
> >
> > From a high level, I still stand by the comments I made on your
> > LoopVectorizer review that this is a bit too Intel-specific. ARM and
> AArch64's
> > SAD instructions do not behave as you model here. You model SAD as a per-
> > element subtract, abs, then a cross-lane (horizontal) summation. The VABD
> > instruction in ARM does not do the last step - that is assumed to be
> done in a
> > separate instruction.
> While modeling SAD we wanted to be more generic than being tilted towards
> any target.
> That is why we abstracted the whole semantic (sub, abs, sum) into this
> intrinsic. Even the
> Result we have modeled to be an scalar value.
>
> >
> > Now, we can model your intrinsic semantics in ARM using:
> >
> >   val2 = ABD val0, val1
> >   val3 = ADDV val2
> This can be custom lowered for ARM. At one point X86 also need this kind
> of custom lowering for v16i8 operand type.
> Which I wrongly put into ExpandSAD() and need to be brought back to x86
> lowering.
>
> >
> > However that isn't very efficient - the sum-across-lanes is expensive.
> Ideally,
> > in a loop, we'd do something like this:
> >
> >   loop:
> >     reduction1 = ABD array0[i], array1[i]
> >     br loop
> >   val = ADDV reduction1
> >
> > So we'd only do the horizontal step once, not on every iteration. That
> is why I
> > think having the intrinsic represent just the per-element operations,
> not the
> > horizontal part, would be the lowest common denominator between our
> > two targets. It is easier to pattern match two nodes than to split one
> node
> > apart and LICM part of it.
> >
> > Similarly, the above construction should the loop vectorizer emit it is
> not very
> > good for you. You would never be able to match your PSAD instruction.
> We have already matched "psad" both coming from LV & SLP. So I am think
> this
> generalization is ok.
>
> So
> > my suggestion is this:
> >
> > - The intrinsic only models the per-element operations, not the
> horizontal
> > part.
> This is more of modeling "absolute diff" than "sum of absolute diff".
>
> > - The X86 backend pattern matches the sad intrinsic plus a horizontal
> > reduction to its PSAD node.
> > - The Loop Vectorizer has code to emit *either* the per-element-only or
> the
> > per-element-plus-horizontal SAD as part of its reduction logic.
> > - The TargetTransformInfo hook is extended to return an `enum SADType`,
> in
> > much the same way that `popcnt` type is done currently.
> >
> > How does this sound?
> With the current modeling I don't see why custom lowering will not be a
> good solution for any target
> Related issues.
>
> >
> > Cheers,
> >
> > James
> >
> >
> > REPOSITORY
> >   rL LLVM
> >
> > http://reviews.llvm.org/D9029
> >
> > EMAIL PREFERENCES
> >   http://reviews.llvm.org/settings/panel/emailpreferences/
> >
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150420/4cd55d90/attachment.html>