[PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and corresponding ISD::SAD node for "Sum Of Absolute Differences" operation

Thu Apr 16 04:08:27 PDT 2015


> -----Original Message-----
> From: James Molloy [mailto:james.molloy at arm.com]
> Sent: Thursday, April 16, 2015 3:00 PM
> To: Shahid, Asghar-ahmad; hfinkel at anl.gov; spatel at rotateright.com;
> aschwaighofer at apple.com; ahmed.bougacha at gmail.com;
> james.molloy at arm.com
> Cc: llvm-commits at cs.uiuc.edu
> Subject: Re: [PATCH] [PATCH][CodeGen] Adding "llvm.sad" intrinsic and
> corresponding ISD::SAD node for "Sum Of Absolute Differences" operation
> 
> Hi,
> 
> Thanks for continuing to work on this!
> 
> From a high level, I still stand by the comments I made on your
> LoopVectorizer review that this is a bit too Intel-specific. ARM and AArch64's
> SAD instructions do not behave as you model here. You model SAD as a per-
> element subtract, abs, then a cross-lane (horizontal) summation. The VABD
> instruction in ARM does not do the last step - that is assumed to be done in a
> separate instruction.
While modeling SAD we wanted to be more generic than being tilted towards any target.
That is why we abstracted the whole semantic (sub, abs, sum) into this intrinsic. Even the
Result we have modeled to be an scalar value.

> 
> Now, we can model your intrinsic semantics in ARM using:
> 
>   val2 = ABD val0, val1
>   val3 = ADDV val2
This can be custom lowered for ARM. At one point X86 also need this kind of custom lowering for v16i8 operand type.
Which I wrongly put into ExpandSAD() and need to be brought back to x86 lowering.

> 
> However that isn't very efficient - the sum-across-lanes is expensive. Ideally,
> in a loop, we'd do something like this:
> 
>   loop:
>     reduction1 = ABD array0[i], array1[i]
>     br loop
>   val = ADDV reduction1
> 
> So we'd only do the horizontal step once, not on every iteration. That is why I
> think having the intrinsic represent just the per-element operations, not the
> horizontal part, would be the lowest common denominator between our
> two targets. It is easier to pattern match two nodes than to split one node
> apart and LICM part of it.
> 
> Similarly, the above construction should the loop vectorizer emit it is not very
> good for you. You would never be able to match your PSAD instruction. 
We have already matched "psad" both coming from LV & SLP. So I am think this
generalization is ok.

So
> my suggestion is this:
> 
> - The intrinsic only models the per-element operations, not the horizontal
> part.
This is more of modeling "absolute diff" than "sum of absolute diff".

> - The X86 backend pattern matches the sad intrinsic plus a horizontal
> reduction to its PSAD node.
> - The Loop Vectorizer has code to emit *either* the per-element-only or the
> per-element-plus-horizontal SAD as part of its reduction logic.
> - The TargetTransformInfo hook is extended to return an `enum SADType`, in
> much the same way that `popcnt` type is done currently.
> 
> How does this sound?
With the current modeling I don't see why custom lowering will not be a good solution for any target
Related issues.

> 
> Cheers,
> 
> James
> 
> 
> REPOSITORY
>   rL LLVM
> 
> http://reviews.llvm.org/D9029
> 
> EMAIL PREFERENCES
>   http://reviews.llvm.org/settings/panel/emailpreferences/
>