[llvm] r250560 - [x86] promote 'add nsw' to a wider type to allow more combines

Tue Oct 20 12:57:26 PDT 2015

Thanks!

On Tue, Oct 20, 2015 at 9:02 AM, Sanjay Patel <spatel at rotateright.com>
wrote:

> I don't know if it's worth chasing, but I've listed the trade-offs that I
> see and filed it here:
> https://llvm.org/bugs/show_bug.cgi?id=25261
>
> On Mon, Oct 19, 2015 at 11:07 PM, Sean Silva <chisophugis at gmail.com>
> wrote:
>
>>
>>
>> On Fri, Oct 16, 2015 at 3:14 PM, Sanjay Patel via llvm-commits <
>> llvm-commits at lists.llvm.org> wrote:
>>
>>> Author: spatel
>>> Date: Fri Oct 16 17:14:12 2015
>>> New Revision: 250560
>>>
>>> URL: http://llvm.org/viewvc/llvm-project?rev=250560&view=rev
>>> Log:
>>> [x86] promote 'add nsw' to a wider type to allow more combines
>>>
>>> The motivation for this patch starts with PR20134:
>>> https://llvm.org/bugs/show_bug.cgi?id=20134
>>>
>>> void foo(int *a, int i) {
>>>   a[i] = a[i+1] + a[i+2];
>>> }
>>>
>>> It seems better to produce this (14 bytes):
>>>
>>> movslq  %esi, %rsi
>>> movl    0x4(%rdi,%rsi,4), %eax
>>> addl    0x8(%rdi,%rsi,4), %eax
>>> movl    %eax, (%rdi,%rsi,4)
>>>
>>
>> Can we CSE the (%rdi,%rsi,4) out of here? On Jaguar each one is going to
>> introduce a 2 cycle latency into the memory pipeline for the load/store
>> (see http://www.realworldtech.com/jaguar/6/) (and they are on the
>> critical path here)  I believe that the (%reg1,%reg2,4) is one of the fast
>> cases too, so both the CSE's LEA and the address computations themselves
>> are single-cycle (In other words, I predict that if you microbenchmark
>> this, the CSE'd version will be 2 cycles faster (+ 1 - 3 = -2)).
>>
>> Of course, all of this depends on if we can afford the scratch register.
>> Not sure the answer to that.
>>
>> Even on Haswell & Sandy Bridge, I believe the 2 loads will each take an
>> extra cycle in the load units, but due to having enough AGU's for the ILP
>> here (2 AGU's available for loads), the CSE only saves you a cycle on the
>> critical path on the store.
>>
>> -- Sean Silva
>>
>>
>>>
>>> Rather than this (22 bytes):
>>>
>>> leal    0x1(%rsi), %eax
>>> cltq
>>> leal    0x2(%rsi), %ecx
>>> movslq  %ecx, %rcx
>>> movl    (%rdi,%rcx,4), %ecx
>>> addl    (%rdi,%rax,4), %ecx
>>> movslq  %esi, %rax
>>> movl    %ecx, (%rdi,%rax,4)
>>>
>>> The most basic problem (the first test case in the patch combines
>>> constants) should also be fixed in InstCombine,
>>> but it gets more complicated after that because we need to consider
>>> architecture and micro-architecture. For
>>> example, AArch64 may not see any benefit from the more general transform
>>> because the ISA solves the sexting in
>>> hardware. Some x86 chips may not want to replace 2 ADD insts with 1 LEA,
>>> and there's an attribute for that:
>>> FeatureSlowLEA. But I suspect that doesn't go far enough or maybe it's
>>> not getting used when it should; I'm
>>> also not sure if FeatureSlowLEA should also mean "slow complex
>>> addressing mode".
>>>
>>> I see no perf differences on test-suite with this change running on AMD
>>> Jaguar, but I see small code size
>>> improvements when building clang and the LLVM tools with the patched
>>> compiler.
>>>
>>> A more general solution to the sext(add nsw(x, C)) problem that works
>>> for multiple targets is available
>>> in CodeGenPrepare, but it may take quite a bit more work to get that to
>>> fire on all of the test cases that
>>> this patch takes care of.
>>>
>>> Differential Revision: http://reviews.llvm.org/D13757
>>>
>>> Modified:
>>>     llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
>>>     llvm/trunk/test/CodeGen/X86/add-nsw-sext.ll
>>>
>>> Modified: llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86ISelLowering.cpp?rev=250560&r1=250559&r2=250560&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/lib/Target/X86/X86ISelLowering.cpp (original)
>>> +++ llvm/trunk/lib/Target/X86/X86ISelLowering.cpp Fri Oct 16 17:14:12
>>> 2015
>>> @@ -25767,6 +25767,57 @@ static SDValue PerformSIGN_EXTEND_INREGC
>>>    return SDValue();
>>>  }
>>>
>>> +/// sext(add_nsw(x, C)) --> add(sext(x), C_sext)
>>> +/// Promoting a sign extension ahead of an 'add nsw' exposes
>>> opportunities
>>> +/// to combine math ops, use an LEA, or use a complex addressing mode.
>>> This can
>>> +/// eliminate extend, add, and shift instructions.
>>> +static SDValue promoteSextBeforeAddNSW(SDNode *Sext, SelectionDAG &DAG,
>>> +                                       const X86Subtarget *Subtarget) {
>>> +  // TODO: This should be valid for other integer types.
>>> +  EVT VT = Sext->getValueType(0);
>>> +  if (VT != MVT::i64)
>>> +    return SDValue();
>>> +
>>> +  // We need an 'add nsw' feeding into the 'sext'.
>>> +  SDValue Add = Sext->getOperand(0);
>>> +  if (Add.getOpcode() != ISD::ADD ||
>>> !Add->getFlags()->hasNoSignedWrap())
>>> +    return SDValue();
>>> +
>>> +  // Having a constant operand to the 'add' ensures that we are not
>>> increasing
>>> +  // the instruction count because the constant is extended for free
>>> below.
>>> +  // A constant operand can also become the displacement field of an
>>> LEA.
>>> +  auto *AddOp1 = dyn_cast<ConstantSDNode>(Add.getOperand(1));
>>> +  if (!AddOp1)
>>> +    return SDValue();
>>> +
>>> +  // Don't make the 'add' bigger if there's no hope of combining it
>>> with some
>>> +  // other 'add' or 'shl' instruction.
>>> +  // TODO: It may be profitable to generate simpler LEA instructions in
>>> place
>>> +  // of single 'add' instructions, but the cost model for selecting an
>>> LEA
>>> +  // currently has a high threshold.
>>> +  bool HasLEAPotential = false;
>>> +  for (auto *User : Sext->uses()) {
>>> +    if (User->getOpcode() == ISD::ADD || User->getOpcode() == ISD::SHL)
>>> {
>>> +      HasLEAPotential = true;
>>> +      break;
>>> +    }
>>> +  }
>>> +  if (!HasLEAPotential)
>>> +    return SDValue();
>>> +
>>> +  // Everything looks good, so pull the 'sext' ahead of the 'add'.
>>> +  int64_t AddConstant = AddOp1->getSExtValue();
>>> +  SDValue AddOp0 = Add.getOperand(0);
>>> +  SDValue NewSext = DAG.getNode(ISD::SIGN_EXTEND, SDLoc(Sext), VT,
>>> AddOp0);
>>> +  SDValue NewConstant = DAG.getConstant(AddConstant, SDLoc(Add), VT);
>>> +
>>> +  // The wider add is guaranteed to not wrap because both operands are
>>> +  // sign-extended.
>>> +  SDNodeFlags Flags;
>>> +  Flags.setNoSignedWrap(true);
>>> +  return DAG.getNode(ISD::ADD, SDLoc(Add), VT, NewSext, NewConstant,
>>> &Flags);
>>> +}
>>> +
>>>  static SDValue PerformSExtCombine(SDNode *N, SelectionDAG &DAG,
>>>                                    TargetLowering::DAGCombinerInfo &DCI,
>>>                                    const X86Subtarget *Subtarget) {
>>> @@ -25861,6 +25912,9 @@ static SDValue PerformSExtCombine(SDNode
>>>      if (SDValue R = WidenMaskArithmetic(N, DAG, DCI, Subtarget))
>>>        return R;
>>>
>>> +  if (SDValue NewAdd = promoteSextBeforeAddNSW(N, DAG, Subtarget))
>>> +    return NewAdd;
>>> +
>>>    return SDValue();
>>>  }
>>>
>>>
>>> Modified: llvm/trunk/test/CodeGen/X86/add-nsw-sext.ll
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/add-nsw-sext.ll?rev=250560&r1=250559&r2=250560&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/test/CodeGen/X86/add-nsw-sext.ll (original)
>>> +++ llvm/trunk/test/CodeGen/X86/add-nsw-sext.ll Fri Oct 16 17:14:12 2015
>>> @@ -1,15 +1,14 @@
>>>  ; RUN: llc < %s -mtriple=x86_64-unknown-unknown | FileCheck %s
>>>
>>>  ; The fundamental problem: an add separated from other arithmetic by a
>>> sext can't
>>> -; be combined with the later instructions. However, if the first add is
>>> 'nsw',
>>> +; be combined with the later instructions. However, if the first add is
>>> 'nsw',
>>>  ; then we can promote the sext ahead of that add to allow optimizations.
>>>
>>>  define i64 @add_nsw_consts(i32 %i) {
>>>  ; CHECK-LABEL: add_nsw_consts:
>>>  ; CHECK:       # BB#0:
>>> -; CHECK-NEXT:    addl $5, %edi
>>>  ; CHECK-NEXT:    movslq %edi, %rax
>>> -; CHECK-NEXT:    addq $7, %rax
>>> +; CHECK-NEXT:    addq $12, %rax
>>>  ; CHECK-NEXT:    retq
>>>
>>>    %add = add nsw i32 %i, 5
>>> @@ -24,9 +23,8 @@ define i64 @add_nsw_consts(i32 %i) {
>>>  define i64 @add_nsw_sext_add(i32 %i, i64 %x) {
>>>  ; CHECK-LABEL: add_nsw_sext_add:
>>>  ; CHECK:       # BB#0:
>>> -; CHECK-NEXT:    addl $5, %edi
>>>  ; CHECK-NEXT:    movslq %edi, %rax
>>> -; CHECK-NEXT:    addq %rsi, %rax
>>> +; CHECK-NEXT:    leaq 5(%rax,%rsi), %rax
>>>  ; CHECK-NEXT:    retq
>>>
>>>    %add = add nsw i32 %i, 5
>>> @@ -41,9 +39,8 @@ define i64 @add_nsw_sext_add(i32 %i, i64
>>>  define i64 @add_nsw_sext_lsh_add(i32 %i, i64 %x) {
>>>  ; CHECK-LABEL: add_nsw_sext_lsh_add:
>>>  ; CHECK:       # BB#0:
>>> -; CHECK-NEXT:    addl $-5, %edi
>>>  ; CHECK-NEXT:    movslq %edi, %rax
>>> -; CHECK-NEXT:    leaq (%rsi,%rax,8), %rax
>>> +; CHECK-NEXT:    leaq -40(%rsi,%rax,8), %rax
>>>  ; CHECK-NEXT:    retq
>>>
>>>    %add = add nsw i32 %i, -5
>>> @@ -73,9 +70,8 @@ define i64 @add_nsw_sext(i32 %i, i64 %x)
>>>  define i8* @gep8(i32 %i, i8* %x) {
>>>  ; CHECK-LABEL: gep8:
>>>  ; CHECK:       # BB#0:
>>> -; CHECK-NEXT:    addl $5, %edi
>>>  ; CHECK-NEXT:    movslq %edi, %rax
>>> -; CHECK-NEXT:    addq %rsi, %rax
>>> +; CHECK-NEXT:    leaq 5(%rax,%rsi), %rax
>>>  ; CHECK-NEXT:    retq
>>>
>>>    %add = add nsw i32 %i, 5
>>> @@ -87,9 +83,8 @@ define i8* @gep8(i32 %i, i8* %x) {
>>>  define i16* @gep16(i32 %i, i16* %x) {
>>>  ; CHECK-LABEL: gep16:
>>>  ; CHECK:       # BB#0:
>>> -; CHECK-NEXT:    addl $-5, %edi
>>>  ; CHECK-NEXT:    movslq %edi, %rax
>>> -; CHECK-NEXT:    leaq (%rsi,%rax,2), %rax
>>> +; CHECK-NEXT:    leaq -10(%rsi,%rax,2), %rax
>>>  ; CHECK-NEXT:    retq
>>>
>>>    %add = add nsw i32 %i, -5
>>> @@ -101,9 +96,8 @@ define i16* @gep16(i32 %i, i16* %x) {
>>>  define i32* @gep32(i32 %i, i32* %x) {
>>>  ; CHECK-LABEL: gep32:
>>>  ; CHECK:       # BB#0:
>>> -; CHECK-NEXT:    addl $5, %edi
>>>  ; CHECK-NEXT:    movslq %edi, %rax
>>> -; CHECK-NEXT:    leaq (%rsi,%rax,4), %rax
>>> +; CHECK-NEXT:    leaq 20(%rsi,%rax,4), %rax
>>>  ; CHECK-NEXT:    retq
>>>
>>>    %add = add nsw i32 %i, 5
>>> @@ -115,9 +109,8 @@ define i32* @gep32(i32 %i, i32* %x) {
>>>  define i64* @gep64(i32 %i, i64* %x) {
>>>  ; CHECK-LABEL: gep64:
>>>  ; CHECK:       # BB#0:
>>> -; CHECK-NEXT:    addl $-5, %edi
>>>  ; CHECK-NEXT:    movslq %edi, %rax
>>> -; CHECK-NEXT:    leaq (%rsi,%rax,8), %rax
>>> +; CHECK-NEXT:    leaq -40(%rsi,%rax,8), %rax
>>>  ; CHECK-NEXT:    retq
>>>
>>>    %add = add nsw i32 %i, -5
>>> @@ -131,10 +124,9 @@ define i64* @gep64(i32 %i, i64* %x) {
>>>  define i128* @gep128(i32 %i, i128* %x) {
>>>  ; CHECK-LABEL: gep128:
>>>  ; CHECK:       # BB#0:
>>> -; CHECK-NEXT:    addl $5, %edi
>>>  ; CHECK-NEXT:    movslq %edi, %rax
>>>  ; CHECK-NEXT:    shlq $4, %rax
>>> -; CHECK-NEXT:    addq %rsi, %rax
>>> +; CHECK-NEXT:    leaq 80(%rax,%rsi), %rax
>>>  ; CHECK-NEXT:    retq
>>>
>>>    %add = add nsw i32 %i, 5
>>> @@ -150,14 +142,10 @@ define i128* @gep128(i32 %i, i128* %x) {
>>>  define void @PR20134(i32* %a, i32 %i) {
>>>  ; CHECK-LABEL: PR20134:
>>>  ; CHECK:       # BB#0:
>>> -; CHECK-NEXT:    leal 1(%rsi), %eax
>>> -; CHECK-NEXT:    cltq
>>> -; CHECK-NEXT:    movl (%rdi,%rax,4), %eax
>>> -; CHECK-NEXT:    leal 2(%rsi), %ecx
>>> -; CHECK-NEXT:    movslq %ecx, %rcx
>>> -; CHECK-NEXT:    addl (%rdi,%rcx,4), %eax
>>> -; CHECK-NEXT:    movslq %esi, %rcx
>>> -; CHECK-NEXT:    movl %eax, (%rdi,%rcx,4)
>>> +; CHECK-NEXT:    movslq %esi, %rax
>>> +; CHECK-NEXT:    movl 4(%rdi,%rax,4), %ecx
>>> +; CHECK-NEXT:    addl 8(%rdi,%rax,4), %ecx
>>> +; CHECK-NEXT:    movl %ecx, (%rdi,%rax,4)
>>>  ; CHECK-NEXT:    retq
>>>
>>>    %add1 = add nsw i32 %i, 1
>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20151020/ea215e43/attachment.html>