[PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation

Mon Apr 20 01:32:56 PDT 2015

Hi Shahid,

No matter how horizontal sums are modelled LICM will always have great
difficulty recognizing that it can be hoisted. I think this is beyond
LICM's abilities, which is why I suggested we allow the Loop Vectorizer to
create several different idioms, one of which has the sum already hoisted.
The Loop Vectorizer has all the knowledge to know that this is legal and
profitable to do.

I don't think you're right there with regards signedness. The conceptual
operation of SAD is this:

  1. Extend inputs to the output size (i8 -> i32 in PSAD's case)
  2. Subtract the inputs
  3. abs() the result
  4. (optionally) sum the abs()s.

I think steps (2) and (4) are signedness-independent. I think steps (1) and
(3) are not, and step (1) is where ARM's SABD differs from UABD. X86's PSAD
also extends the inputs from i8 to i32, so I don't think PSAD is signedness
independent either.

Cheers,

James

On Mon, 20 Apr 2015 at 07:08 Shahid, Asghar-ahmad <
Asghar-ahmad.Shahid at amd.com> wrote:

>  Hi James,
>
>
>
> Thanks for your explanation.
>
>
>
> >With regards custom lowering, I think you have misinterpreted me. What I
> was saying was that if the intrinsic is defined as "sum( abs( *a++ - *b++ )
> )", non-X86 backends could custom lower it as something like "ABD + ADDV"
> (absolute >difference, sum-of-all-lanes). However, you'd end up with the
> sum-of-all-lanes unnecessarily being inside the loop! By the time the
> intrinsic is expanded, it may be difficult to determine that the sum can be
> moved outside the loop.
>
>
>
> If the horizontal sum is an intrinsic, how the LICM will happen on it?
>
>
>
> >Also, you haven't answered my question about signedness that I've
> mentioned several times.
>
> Currently the SAD intrinsic is modeled for unsigned data types. AFAIK,
> signedness matters
>
> only when an arithmetic operation results in a carry or overflow. We have
> not seen a use case
>
> for signed data types yet. Even in case of ARM, “SABD” (if I am correct)
> does not
>
> set overflow flag.
>
>
>
> Regards,
>
> Shahid
>
>
>
>
>
> *From:* James Molloy [mailto:james at jamesmolloy.co.uk]
> *Sent:* Thursday, April 16, 2015 9:42 PM
> *To:* Shahid, Asghar-ahmad;
> reviews+D9029+public+eee34c83a2c6f996 at reviews.llvm.org; hfinkel at anl.gov;
> spatel at rotateright.com; aschwaighofer at apple.com; ahmed.bougacha at gmail.com
>
>
> *Cc:* llvm-commits at cs.uiuc.edu
> *Subject:* Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and
> corresponding ISD::SAD node for "Sum Of Absolute Differences" operation
>
>
>
> Hi Asghar-ahmad,
>
> Thanks for responding. I'll try and explain in more detail what I mean. I
> agree that we can custom lower things and that we could implement your
> intrinsic on our and other architecture. That is not in question. What is
> in question is whether the definition of the intrinsic and behaviour as-is
> would allow *efficient* implementation on multiple target architectures.
>
>
>
> To reiterate the examples from earlier, there are seemingly two different
> approaches for lowering a sum-of-absolute-differences loop.  Assume 'a' and
> 'b' are the two input arrays, as some p\
>
> ointer to vector type.
>
>
>
> 1:
>
> int r = 0;
>
> for (i = ...) {
>
>   r += sum( abs( *a++ - *b++ ) );
>
> }
>
> // r holds the sum-of-absolute-differences
>
>
>
> 2:
>
> vector int r = {0};
>
> for (i = ...) {
>
>   r += abs( *a++ - *b++ );
>
> }
>
> // r holds partial sums.
>
> int sad = sum(r);
>
> // sad holds the sum-of-absolute-differences
>
>
>
> The most efficient form of lowering for X86 may possibly be (1), where a
> PSAD instruction can be used (although for non-i8 types perhaps not?).
> For ARM, AArch64 and according to an appnote I found [0] Altivec (I
> couldn't find anything about MIPS), (2) is going to be better.
>
>
>
> So the goal as I see it is to define these intrinsics and IR idioms such
> that both forms can be generated depending on the target (and/or datatype
> - you don't have PSAD for floating point types, so if someone does a
> non-int SAD loop the most efficient form for you would be (2)).
>
>
>
> With regards custom lowering, I think you have misinterpreted me. What I
> was saying was that if the intrinsic is defined as "sum( abs( *a++ - *b++ )
> )", non-X86 backends could custom lower it as something like "ABD + ADDV"
> (absolute difference, sum-of-all-lanes). However, you'd end up with the
> sum-of-all-lanes unnecessarily being inside the loop! By the time the
> intrinsic is expanded, it may be difficult to determine that the sum can be
> moved outside the loop.
>
>
>
> Conversely, if we defined the intrinsic as "abs( *a++ - *b--)", we could
> still easily generate loop type (1) by adding a sum() around it. As it is
> easier to match a pattern than split a pattern apart and move it around
> (ISel is made for pattern matching!) this is the implementation I am
> suggesting.
>
>
>
> Yes, you're right, this means the name "SAD" for the node may be a
> misnomer. What I've asked for is the splitting apart of an opaque intrinsic
> into a smaller opaque intrinsic and generic support IR, which is something
> we do try and do where possible elsewhere in the compiler. I hope I've
> explained why the node as you've described it may not be useful for any
> non-X86 target.
>
>
>
> Also, you haven't answered my question about signedness that I've
> mentioned several times.
>
>
>
> Cheers,
>
>
>
> [0]
> http://www.freescale.com/webapp/sps/download/license.jsp?colCode=AVEC_SAD
>
>
>
> On Thu, 16 Apr 2015 at 12:12 Shahid, Asghar-ahmad <
> Asghar-ahmad.Shahid at amd.com> wrote:
>
>
>
> > -----Original Message-----
> > From: James Molloy [mailto:james.molloy at arm.com]
> > Sent: Thursday, April 16, 2015 3:00 PM
> > To: Shahid, Asghar-ahmad; hfinkel at anl.gov; spatel at rotateright.com;
> > aschwaighofer at apple.com; ahmed.bougacha at gmail.com;
> > james.molloy at arm.com
> > Cc: llvm-commits at cs.uiuc.edu
> > Subject: Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and
> > corresponding ISD::SAD node for "Sum Of Absolute Differences" operation
> >
> > Hi,
> >
> > Thanks for continuing to work on this!
> >
> > From a high level, I still stand by the comments I made on your
> > LoopVectorizer review that this is a bit too Intel-specific. ARM and
> AArch64's
> > SAD instructions do not behave as you model here. You model SAD as a per-
> > element subtract, abs, then a cross-lane (horizontal) summation. The VABD
> > instruction in ARM does not do the last step - that is assumed to be
> done in a
> > separate instruction.
> While modeling SAD we wanted to be more generic than being tilted towards
> any target.
> That is why we abstracted the whole semantic (sub, abs, sum) into this
> intrinsic. Even the
> Result we have modeled to be an scalar value.
>
> >
> > Now, we can model your intrinsic semantics in ARM using:
> >
> >   val2 = ABD val0, val1
> >   val3 = ADDV val2
> This can be custom lowered for ARM. At one point X86 also need this kind
> of custom lowering for v16i8 operand type.
> Which I wrongly put into ExpandSAD() and need to be brought back to x86
> lowering.
>
> >
> > However that isn't very efficient - the sum-across-lanes is expensive.
> Ideally,
> > in a loop, we'd do something like this:
> >
> >   loop:
> >     reduction1 = ABD array0[i], array1[i]
> >     br loop
> >   val = ADDV reduction1
> >
> > So we'd only do the horizontal step once, not on every iteration. That
> is why I
> > think having the intrinsic represent just the per-element operations,
> not the
> > horizontal part, would be the lowest common denominator between our
> > two targets. It is easier to pattern match two nodes than to split one
> node
> > apart and LICM part of it.
> >
> > Similarly, the above construction should the loop vectorizer emit it is
> not very
> > good for you. You would never be able to match your PSAD instruction.
> We have already matched "psad" both coming from LV & SLP. So I am think
> this
> generalization is ok.
>
> So
> > my suggestion is this:
> >
> > - The intrinsic only models the per-element operations, not the
> horizontal
> > part.
> This is more of modeling "absolute diff" than "sum of absolute diff".
>
> > - The X86 backend pattern matches the sad intrinsic plus a horizontal
> > reduction to its PSAD node.
> > - The Loop Vectorizer has code to emit *either* the per-element-only or
> the
> > per-element-plus-horizontal SAD as part of its reduction logic.
> > - The TargetTransformInfo hook is extended to return an `enum SADType`,
> in
> > much the same way that `popcnt` type is done currently.
> >
> > How does this sound?
> With the current modeling I don't see why custom lowering will not be a
> good solution for any target
> Related issues.
>
> >
> > Cheers,
> >
> > James
> >
> >
> > REPOSITORY
> >   rL LLVM
> >
> > http://reviews.llvm.org/D9029
> >
> > EMAIL PREFERENCES
> >   http://reviews.llvm.org/settings/panel/emailpreferences/
> >
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150420/2d9b7c24/attachment.html>