[LLVMdev] [RFC] Integer Saturation Intrinsics

Thu Jan 15 10:26:08 PST 2015

On 01/14/2015 04:16 PM, Ahmed Bougacha wrote:
> On Thu, Jan 15, 2015 at 12:42 AM, Philip Reames
> <listmail at philipreames.com> wrote:
>> At a very high level, why do we need these intrinsics?
> In short, to catch sequences you can't catch in the SelectionDAG.
>
>> What is the use case?  What are typical values for N?
> Typically, you get this from (a little overlapping) compression, DSP,
> or pixel-handling code.
> Off the top of my head, this occurs in paq8p in the test-suite, as
> well as a few other tests.
>
> You'd have something like:
>      a = x + y;
>      if (a < -128)
>        a = -128;
>      if (a > 127)
>        a = 127;
>
>> Have you looked at just generating the conditional IR and then pattern
>> matching late?  What's the performance trade off involved?
> That's a valid concern.  The original problem is, we can't catch this
> kind of thing in the SelectionDAG, because we're limited by a single
> basic block.  I guess we could (and I gather that's the alternative
> you're presenting?) canonicalize the control flow to the 2icmp+2select
> sequence, but I wasn't sure that was "workable".  Truth be told, I
> didn't investigate this very thoroughly, as I didn't expect reluctance
> on adding intrinsics!  I'll look into it some more: avoid adding the
> intrinsic, keep the codegen additions as is, match the pattern in CGP
> instead of InstCombines.
Just to be clear, I'm not saying "don't add an intrinsic".  I am saying 
"make sure the cost of the intrinsic are worth it".  In particular, I 
think you're going to give up a lot of optimization benefit in practice 
by using intrinsics unless you put a *lot* of effort into making it work 
everywhere.

One middle ground might be to have an intrinsic which is immediately 
lowered to the select form, let that run through most of the IR 
optimization passes, then pattern match in either SelectionDAG or CGP 
back to the intrinsic.  Not saying this is neccessarily the right 
approach, but it might be worth thinking about.
>
>> My default position is that we shouldn't add these unless there is a
>> compelling use case.  We *still* haven't gotten the *.with.overflow
>> intrinsics piped through everywhere in the optimizer.  I'm reluctant to keep
>> adding new ones.
> I get your point, and agree with the general idea, but I think
> saturation is easier to reason about so wouldn't be as big a problem.
> At least it was pretty fast to implement the usual suspects:
> vectorizability, known bits, etc..  Do you have a specific part in
> mind?
known bits, inst simplify, inst combine, slp vectorize, loop vectorize, 
licm, gvn, lvi, early cse, etc.., etc...
>
> Anyway, thanks for the feedback!
>
> -Ahmed
>
>> Philip
>>
>>
>> On 01/14/2015 02:08 PM, Ahmed Bougacha wrote:
>>> Hi all,
>>>
>>> The patches linked below introduce a new family of intrinsics, for
>>> integer saturation: @llvm.usat, and @llvm.ssat (unsigned/signed).
>>> Quoting the added documentation:
>>>
>>>         %r = call i32 @llvm.ssat.i32(i32 %x, i32 %n)
>>>
>>> is equivalent to the expression min(max(x, -2^(n-1)), 2^(n-1)-1), itself
>>> implementable as the following IR:
>>>
>>>         %min_sint_n = i32 ... ; the min. signed integer of bitwidth n,
>>> -2^(n-1)
>>>         %max_sint_n = i32 ... ; the max. signed integer of bitwidth n,
>>> 2^(n-1)-1
>>>         %0 = icmp slt i32 %x, %min_sint_n
>>>         %1 = select i1 %0, i32 %min_sint_n, i32 %x
>>>         %2 = icmp sgt i32 %1, %max_sint_n
>>>         %r = select i1 %2, i32 %max_sint_n, i32 %1
>>>
>>>
>>> As a starting point, here are two patches:
>>> - http://reviews.llvm.org/D6976  Add Integer Saturation Intrinsics.
>>> - http://reviews.llvm.org/D6977  [CodeGen] Add legalization for
>>> Integer Saturation Intrinsics.
>>>
>>>   From there, we can generate several new instructions, more efficient
>>> than their expanded counterpart.  Locally, I have worked on:
>>> - ARM: the SSAT/USAT instructions (scalar)
>>> - AArch64: the SQ/UQ ADD/SUB AArch64 instructions (vector/scalar
>>> saturating arithmetic)
>>> - X86: PACK SS/US (vector, saturate+truncate)
>>> - X86: PADD/SUB S/US (vector, saturating arithmetic)
>>>
>>> Anyway, let's first agree on the intrinsics, so that further
>>> development is done on trunk.
>>>
>>> Thanks!
>>> -Ahmed
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>