[llvm] r250560 - [x86] promote 'add nsw' to a wider type to allow more combines

Mon Oct 19 22:07:54 PDT 2015

On Fri, Oct 16, 2015 at 3:14 PM, Sanjay Patel via llvm-commits <
llvm-commits at lists.llvm.org> wrote:

> Author: spatel
> Date: Fri Oct 16 17:14:12 2015
> New Revision: 250560
>
> URL: http://llvm.org/viewvc/llvm-project?rev=250560&view=rev
> Log:
> [x86] promote 'add nsw' to a wider type to allow more combines
>
> The motivation for this patch starts with PR20134:
> https://llvm.org/bugs/show_bug.cgi?id=20134
>
> void foo(int *a, int i) {
>   a[i] = a[i+1] + a[i+2];
> }
>
> It seems better to produce this (14 bytes):
>
> movslq  %esi, %rsi
> movl    0x4(%rdi,%rsi,4), %eax
> addl    0x8(%rdi,%rsi,4), %eax
> movl    %eax, (%rdi,%rsi,4)
>

Can we CSE the (%rdi,%rsi,4) out of here? On Jaguar each one is going to
introduce a 2 cycle latency into the memory pipeline for the load/store
(see http://www.realworldtech.com/jaguar/6/) (and they are on the critical
path here)  I believe that the (%reg1,%reg2,4) is one of the fast cases
too, so both the CSE's LEA and the address computations themselves are
single-cycle (In other words, I predict that if you microbenchmark this,
the CSE'd version will be 2 cycles faster (+ 1 - 3 = -2)).

Of course, all of this depends on if we can afford the scratch register.
Not sure the answer to that.

Even on Haswell & Sandy Bridge, I believe the 2 loads will each take an
extra cycle in the load units, but due to having enough AGU's for the ILP
here (2 AGU's available for loads), the CSE only saves you a cycle on the
critical path on the store.

-- Sean Silva

>
> Rather than this (22 bytes):
>
> leal    0x1(%rsi), %eax
> cltq
> leal    0x2(%rsi), %ecx
> movslq  %ecx, %rcx
> movl    (%rdi,%rcx,4), %ecx
> addl    (%rdi,%rax,4), %ecx
> movslq  %esi, %rax
> movl    %ecx, (%rdi,%rax,4)
>
> The most basic problem (the first test case in the patch combines
> constants) should also be fixed in InstCombine,
> but it gets more complicated after that because we need to consider
> architecture and micro-architecture. For
> example, AArch64 may not see any benefit from the more general transform
> because the ISA solves the sexting in
> hardware. Some x86 chips may not want to replace 2 ADD insts with 1 LEA,
> and there's an attribute for that:
> FeatureSlowLEA. But I suspect that doesn't go far enough or maybe it's not
> getting used when it should; I'm
> also not sure if FeatureSlowLEA should also mean "slow complex addressing
> mode".
>
> I see no perf differences on test-suite with this change running on AMD
> Jaguar, but I see small code size
> improvements when building clang and the LLVM tools with the patched
> compiler.
>
> A more general solution to the sext(add nsw(x, C)) problem that works for
> multiple targets is available
> in CodeGenPrepare, but it may take quite a bit more work to get that to
> fire on all of the test cases that
> this patch takes care of.
>
> Differential Revision: http://reviews.llvm.org/D13757
>
> Modified:
>     llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
>     llvm/trunk/test/CodeGen/X86/add-nsw-sext.ll
>
> Modified: llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86ISelLowering.cpp?rev=250560&r1=250559&r2=250560&view=diff
>
> ==============================================================================
> --- llvm/trunk/lib/Target/X86/X86ISelLowering.cpp (original)
> +++ llvm/trunk/lib/Target/X86/X86ISelLowering.cpp Fri Oct 16 17:14:12 2015
> @@ -25767,6 +25767,57 @@ static SDValue PerformSIGN_EXTEND_INREGC
>    return SDValue();
>  }
>
> +/// sext(add_nsw(x, C)) --> add(sext(x), C_sext)
> +/// Promoting a sign extension ahead of an 'add nsw' exposes opportunities
> +/// to combine math ops, use an LEA, or use a complex addressing mode.
> This can
> +/// eliminate extend, add, and shift instructions.
> +static SDValue promoteSextBeforeAddNSW(SDNode *Sext, SelectionDAG &DAG,
> +                                       const X86Subtarget *Subtarget) {
> +  // TODO: This should be valid for other integer types.
> +  EVT VT = Sext->getValueType(0);
> +  if (VT != MVT::i64)
> +    return SDValue();
> +
> +  // We need an 'add nsw' feeding into the 'sext'.
> +  SDValue Add = Sext->getOperand(0);
> +  if (Add.getOpcode() != ISD::ADD || !Add->getFlags()->hasNoSignedWrap())
> +    return SDValue();
> +
> +  // Having a constant operand to the 'add' ensures that we are not
> increasing
> +  // the instruction count because the constant is extended for free
> below.
> +  // A constant operand can also become the displacement field of an LEA.
> +  auto *AddOp1 = dyn_cast<ConstantSDNode>(Add.getOperand(1));
> +  if (!AddOp1)
> +    return SDValue();
> +
> +  // Don't make the 'add' bigger if there's no hope of combining it with
> some
> +  // other 'add' or 'shl' instruction.
> +  // TODO: It may be profitable to generate simpler LEA instructions in
> place
> +  // of single 'add' instructions, but the cost model for selecting an LEA
> +  // currently has a high threshold.
> +  bool HasLEAPotential = false;
> +  for (auto *User : Sext->uses()) {
> +    if (User->getOpcode() == ISD::ADD || User->getOpcode() == ISD::SHL) {
> +      HasLEAPotential = true;
> +      break;
> +    }
> +  }
> +  if (!HasLEAPotential)
> +    return SDValue();
> +
> +  // Everything looks good, so pull the 'sext' ahead of the 'add'.
> +  int64_t AddConstant = AddOp1->getSExtValue();
> +  SDValue AddOp0 = Add.getOperand(0);
> +  SDValue NewSext = DAG.getNode(ISD::SIGN_EXTEND, SDLoc(Sext), VT,
> AddOp0);
> +  SDValue NewConstant = DAG.getConstant(AddConstant, SDLoc(Add), VT);
> +
> +  // The wider add is guaranteed to not wrap because both operands are
> +  // sign-extended.
> +  SDNodeFlags Flags;
> +  Flags.setNoSignedWrap(true);
> +  return DAG.getNode(ISD::ADD, SDLoc(Add), VT, NewSext, NewConstant,
> &Flags);
> +}
> +
>  static SDValue PerformSExtCombine(SDNode *N, SelectionDAG &DAG,
>                                    TargetLowering::DAGCombinerInfo &DCI,
>                                    const X86Subtarget *Subtarget) {
> @@ -25861,6 +25912,9 @@ static SDValue PerformSExtCombine(SDNode
>      if (SDValue R = WidenMaskArithmetic(N, DAG, DCI, Subtarget))
>        return R;
>
> +  if (SDValue NewAdd = promoteSextBeforeAddNSW(N, DAG, Subtarget))
> +    return NewAdd;
> +
>    return SDValue();
>  }
>
>
> Modified: llvm/trunk/test/CodeGen/X86/add-nsw-sext.ll
> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/add-nsw-sext.ll?rev=250560&r1=250559&r2=250560&view=diff
>
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/add-nsw-sext.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/add-nsw-sext.ll Fri Oct 16 17:14:12 2015
> @@ -1,15 +1,14 @@
>  ; RUN: llc < %s -mtriple=x86_64-unknown-unknown | FileCheck %s
>
>  ; The fundamental problem: an add separated from other arithmetic by a
> sext can't
> -; be combined with the later instructions. However, if the first add is
> 'nsw',
> +; be combined with the later instructions. However, if the first add is
> 'nsw',
>  ; then we can promote the sext ahead of that add to allow optimizations.
>
>  define i64 @add_nsw_consts(i32 %i) {
>  ; CHECK-LABEL: add_nsw_consts:
>  ; CHECK:       # BB#0:
> -; CHECK-NEXT:    addl $5, %edi
>  ; CHECK-NEXT:    movslq %edi, %rax
> -; CHECK-NEXT:    addq $7, %rax
> +; CHECK-NEXT:    addq $12, %rax
>  ; CHECK-NEXT:    retq
>
>    %add = add nsw i32 %i, 5
> @@ -24,9 +23,8 @@ define i64 @add_nsw_consts(i32 %i) {
>  define i64 @add_nsw_sext_add(i32 %i, i64 %x) {
>  ; CHECK-LABEL: add_nsw_sext_add:
>  ; CHECK:       # BB#0:
> -; CHECK-NEXT:    addl $5, %edi
>  ; CHECK-NEXT:    movslq %edi, %rax
> -; CHECK-NEXT:    addq %rsi, %rax
> +; CHECK-NEXT:    leaq 5(%rax,%rsi), %rax
>  ; CHECK-NEXT:    retq
>
>    %add = add nsw i32 %i, 5
> @@ -41,9 +39,8 @@ define i64 @add_nsw_sext_add(i32 %i, i64
>  define i64 @add_nsw_sext_lsh_add(i32 %i, i64 %x) {
>  ; CHECK-LABEL: add_nsw_sext_lsh_add:
>  ; CHECK:       # BB#0:
> -; CHECK-NEXT:    addl $-5, %edi
>  ; CHECK-NEXT:    movslq %edi, %rax
> -; CHECK-NEXT:    leaq (%rsi,%rax,8), %rax
> +; CHECK-NEXT:    leaq -40(%rsi,%rax,8), %rax
>  ; CHECK-NEXT:    retq
>
>    %add = add nsw i32 %i, -5
> @@ -73,9 +70,8 @@ define i64 @add_nsw_sext(i32 %i, i64 %x)
>  define i8* @gep8(i32 %i, i8* %x) {
>  ; CHECK-LABEL: gep8:
>  ; CHECK:       # BB#0:
> -; CHECK-NEXT:    addl $5, %edi
>  ; CHECK-NEXT:    movslq %edi, %rax
> -; CHECK-NEXT:    addq %rsi, %rax
> +; CHECK-NEXT:    leaq 5(%rax,%rsi), %rax
>  ; CHECK-NEXT:    retq
>
>    %add = add nsw i32 %i, 5
> @@ -87,9 +83,8 @@ define i8* @gep8(i32 %i, i8* %x) {
>  define i16* @gep16(i32 %i, i16* %x) {
>  ; CHECK-LABEL: gep16:
>  ; CHECK:       # BB#0:
> -; CHECK-NEXT:    addl $-5, %edi
>  ; CHECK-NEXT:    movslq %edi, %rax
> -; CHECK-NEXT:    leaq (%rsi,%rax,2), %rax
> +; CHECK-NEXT:    leaq -10(%rsi,%rax,2), %rax
>  ; CHECK-NEXT:    retq
>
>    %add = add nsw i32 %i, -5
> @@ -101,9 +96,8 @@ define i16* @gep16(i32 %i, i16* %x) {
>  define i32* @gep32(i32 %i, i32* %x) {
>  ; CHECK-LABEL: gep32:
>  ; CHECK:       # BB#0:
> -; CHECK-NEXT:    addl $5, %edi
>  ; CHECK-NEXT:    movslq %edi, %rax
> -; CHECK-NEXT:    leaq (%rsi,%rax,4), %rax
> +; CHECK-NEXT:    leaq 20(%rsi,%rax,4), %rax
>  ; CHECK-NEXT:    retq
>
>    %add = add nsw i32 %i, 5
> @@ -115,9 +109,8 @@ define i32* @gep32(i32 %i, i32* %x) {
>  define i64* @gep64(i32 %i, i64* %x) {
>  ; CHECK-LABEL: gep64:
>  ; CHECK:       # BB#0:
> -; CHECK-NEXT:    addl $-5, %edi
>  ; CHECK-NEXT:    movslq %edi, %rax
> -; CHECK-NEXT:    leaq (%rsi,%rax,8), %rax
> +; CHECK-NEXT:    leaq -40(%rsi,%rax,8), %rax
>  ; CHECK-NEXT:    retq
>
>    %add = add nsw i32 %i, -5
> @@ -131,10 +124,9 @@ define i64* @gep64(i32 %i, i64* %x) {
>  define i128* @gep128(i32 %i, i128* %x) {
>  ; CHECK-LABEL: gep128:
>  ; CHECK:       # BB#0:
> -; CHECK-NEXT:    addl $5, %edi
>  ; CHECK-NEXT:    movslq %edi, %rax
>  ; CHECK-NEXT:    shlq $4, %rax
> -; CHECK-NEXT:    addq %rsi, %rax
> +; CHECK-NEXT:    leaq 80(%rax,%rsi), %rax
>  ; CHECK-NEXT:    retq
>
>    %add = add nsw i32 %i, 5
> @@ -150,14 +142,10 @@ define i128* @gep128(i32 %i, i128* %x) {
>  define void @PR20134(i32* %a, i32 %i) {
>  ; CHECK-LABEL: PR20134:
>  ; CHECK:       # BB#0:
> -; CHECK-NEXT:    leal 1(%rsi), %eax
> -; CHECK-NEXT:    cltq
> -; CHECK-NEXT:    movl (%rdi,%rax,4), %eax
> -; CHECK-NEXT:    leal 2(%rsi), %ecx
> -; CHECK-NEXT:    movslq %ecx, %rcx
> -; CHECK-NEXT:    addl (%rdi,%rcx,4), %eax
> -; CHECK-NEXT:    movslq %esi, %rcx
> -; CHECK-NEXT:    movl %eax, (%rdi,%rcx,4)
> +; CHECK-NEXT:    movslq %esi, %rax
> +; CHECK-NEXT:    movl 4(%rdi,%rax,4), %ecx
> +; CHECK-NEXT:    addl 8(%rdi,%rax,4), %ecx
> +; CHECK-NEXT:    movl %ecx, (%rdi,%rax,4)
>  ; CHECK-NEXT:    retq
>
>    %add1 = add nsw i32 %i, 1
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20151019/5166245a/attachment.html>