[PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation

Thu Apr 16 10:48:55 PDT 2015

FWIW I agree with James' concerns: I thought the intrinsic had already
been agreed upon, but that doesn't seem to be the case.

This got me thinking, and perhaps that's what you're saying James:  if
the SAD intrinsic is too X86-specific, why do we even need an
intrinsic?   We already express horizontal reductions using the
expanded form (if that's not ideal, then that's a separate topic);
the SUB is just a "sub <16 x i8>".

The tricky part is the ABS, which AFAIK we express using the expanded
form as well (InstCombine seems to agree).

If the only reason for the intrinsic is making it easier on the
vectorizer (I haven't really followed that discussion), it sounds like
there's something to be done for the more general problem (e.g. MAX,
or SAT, which I have not gotten back to yet).

Am I making sense?

-Ahmed

On Thu, Apr 16, 2015 at 2:30 AM, James Molloy <james.molloy at arm.com> wrote:
> Hi,
>
> Thanks for continuing to work on this!
>
> From a high level, I still stand by the comments I made on your LoopVectorizer review that this is a bit too Intel-specific. ARM and AArch64's SAD instructions do not behave as you model here. You model SAD as a per-element subtract, abs, then a cross-lane (horizontal) summation. The VABD instruction in ARM does not do the last step - that is assumed to be done in a separate instruction.
>
> Now, we can model your intrinsic semantics in ARM using:
>
>   val2 = ABD val0, val1
>   val3 = ADDV val2
>
> However that isn't very efficient - the sum-across-lanes is expensive. Ideally, in a loop, we'd do something like this:
>
>   loop:
>     reduction1 = ABD array0[i], array1[i]
>     br loop
>   val = ADDV reduction1
>
> So we'd only do the horizontal step once, not on every iteration. That is why I think having the intrinsic represent just the per-element operations, not the horizontal part, would be the lowest common denominator between our two targets. It is easier to pattern match two nodes than to split one node apart and LICM part of it.
>
> Similarly, the above construction should the loop vectorizer emit it is not very good for you. You would never be able to match your PSAD instruction. So my suggestion is this:
>
> - The intrinsic only models the per-element operations, not the horizontal part.
> - The X86 backend pattern matches the sad intrinsic plus a horizontal reduction to its PSAD node.
> - The Loop Vectorizer has code to emit *either* the per-element-only or the per-element-plus-horizontal SAD as part of its reduction logic.
> - The TargetTransformInfo hook is extended to return an `enum SADType`, in much the same way that `popcnt` type is done currently.
>
> How does this sound?
>
> Cheers,
>
> James
>
>
> REPOSITORY
>   rL LLVM
>
> http://reviews.llvm.org/D9029
>
> EMAIL PREFERENCES
>   http://reviews.llvm.org/settings/panel/emailpreferences/
>
>