<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <br>

    <div class="moz-cite-prefix">On 09/17/2014 11:21 AM, Tom Stellard

      wrote:<br>

    </div>

    <blockquote cite="mid:20140917152131.GA22471@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">Hi,

The attached series adds a pass for lowering 64-bit division in the R600

backend and also fixes some bugs uncovered along the way.

This new pass replaces the old 64-bit div lowering used for Evergreen/NI

subtargets, which was found to have some bugs.

-Tom

</pre>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"><legend

          class="mimeAttachmentHeaderName">0001-R600-SI-Use-ISD-MUL-instead-of-ISD-UMULO-when-loweri.patch</legend></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">From ede048f49e8e550176c567c0bfa1bd3679189c10 Mon Sep 17 00:00:00 2001

From: Tom Stellard <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:thomas.stellard@amd.com"><thomas.stellard@amd.com></a>

Date: Tue, 16 Sep 2014 10:35:23 -0400

Subject: [PATCH 1/8] R600/SI: Use ISD::MUL instead of ISD::UMULO when lowering

 division

ISD::MUL and ISD:UMULO are the same except that UMULO sets an overflow

bit.  Since we aren't using the overflow bit, we should use ISD::MUL.

---

 lib/Target/R600/AMDGPUISelLowering.cpp | 6 +++---

 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/Target/R600/AMDGPUISelLowering.cpp b/lib/Target/R600/AMDGPUISelLowering.cpp

index d7f12ef..293a89d 100644

--- a/lib/Target/R600/AMDGPUISelLowering.cpp

+++ b/lib/Target/R600/AMDGPUISelLowering.cpp

@@ -1510,8 +1510,8 @@ SDValue AMDGPUTargetLowering::LowerUDIVREM(SDValue Op,

   // e is rounding error.

   SDValue RCP = DAG.getNode(AMDGPUISD::URECIP, DL, VT, Den);

-  // RCP_LO = umulo(RCP, Den) */

-  SDValue RCP_LO = DAG.getNode(ISD::UMULO, DL, VT, RCP, Den);

+  // RCP_LO = mul(RCP, Den) */

+  SDValue RCP_LO = DAG.getNode(ISD::MUL, DL, VT, RCP, Den);

   // RCP_HI = mulhu (RCP, Den) */

   SDValue RCP_HI = DAG.getNode(ISD::MULHU, DL, VT, RCP, Den);

@@ -1542,7 +1542,7 @@ SDValue AMDGPUTargetLowering::LowerUDIVREM(SDValue Op,

   SDValue Quotient = DAG.getNode(ISD::MULHU, DL, VT, Tmp0, Num);

   // Num_S_Remainder = Quotient * Den

-  SDValue Num_S_Remainder = DAG.getNode(ISD::UMULO, DL, VT, Quotient, Den);

+  SDValue Num_S_Remainder = DAG.getNode(ISD::MUL, DL, VT, Quotient, Den);

   // Remainder = Num - Num_S_Remainder

   SDValue Remainder = DAG.getNode(ISD::SUB, DL, VT, Num, Num_S_Remainder);

<div class="moz-txt-sig">-- 

1.8.5.5

</div></pre>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"><legend

          class="mimeAttachmentHeaderName">0002-R600-Don-t-set-BypassSlowDiv-for-64-bit-division.patch</legend></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">From 74a83f33e1df72b942fdbe12701cad75750f50f7 Mon Sep 17 00:00:00 2001

From: Tom Stellard <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:thomas.stellard@amd.com"><thomas.stellard@amd.com></a>

Date: Mon, 15 Sep 2014 12:21:35 -0400

Subject: [PATCH 2/8] R600: Don't set BypassSlowDiv for 64-bit division

BypassSlowDiv is used by codegen prepare to insert a run-time

check to see if the operands to a 64-bit division are really 32-bit

values and if they are it will do 32-bit division instead.

This is not useful for R600, which has predicated control flow since

both the 32-bit and 64-bit paths will be executed in most cases.  It

also increases code size which can lead to more instruction cache

misses.

---

 lib/Target/R600/AMDGPUISelLowering.cpp | 3 ---

 1 file changed, 3 deletions(-)

diff --git a/lib/Target/R600/AMDGPUISelLowering.cpp b/lib/Target/R600/AMDGPUISelLowering.cpp

index 293a89d..f353c94 100644

--- a/lib/Target/R600/AMDGPUISelLowering.cpp

+++ b/lib/Target/R600/AMDGPUISelLowering.cpp

@@ -388,9 +388,6 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(TargetMachine &TM) :

   setIntDivIsCheap(false);

   setPow2SDivIsCheap(false);

-  // TODO: Investigate this when 64-bit divides are implemented.

-  addBypassSlowDiv(64, 32);

-

   // FIXME: Need to really handle these.

   MaxStoresPerMemcpy  = 4096;

   MaxStoresPerMemmove = 4096;

<div class="moz-txt-sig">-- 

1.8.5.5

</div></pre>

      </div>

      <br>

    </blockquote>

    <br>

    LGTM<br>

    <br>

    <blockquote cite="mid:20140917152131.GA22471@freedesktop.org"

      type="cite">

      <fieldset class="mimeAttachmentHeader"><legend

          class="mimeAttachmentHeaderName">0003-R600-SI-Use-isOperandLegal-to-simplify-legalization-.patch</legend></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">From 835d8f79d0491611566ac124fe072f8f85723cb3 Mon Sep 17 00:00:00 2001

From: Tom Stellard <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:thomas.stellard@amd.com"><thomas.stellard@amd.com></a>

Date: Tue, 16 Sep 2014 09:19:11 -0400

Subject: [PATCH 3/8] R600/SI: Use isOperandLegal() to simplify legalization of

 VOP3 instructions

---

 lib/Target/R600/SIInstrInfo.cpp | 27 +++------------------------

 1 file changed, 3 insertions(+), 24 deletions(-)

diff --git a/lib/Target/R600/SIInstrInfo.cpp b/lib/Target/R600/SIInstrInfo.cpp

index 294aa70..1b90d41 100644

--- a/lib/Target/R600/SIInstrInfo.cpp

+++ b/lib/Target/R600/SIInstrInfo.cpp

@@ -1174,33 +1174,12 @@ void SIInstrInfo::legalizeOperands(MachineInstr *MI) const {

   // Legalize VOP3

   if (isVOP3(MI->getOpcode())) {

     int VOP3Idx[3] = {Src0Idx, Src1Idx, Src2Idx};

-    unsigned SGPRReg = AMDGPU::NoRegister;

     for (unsigned i = 0; i < 3; ++i) {

       int Idx = VOP3Idx[i];

-      if (Idx == -1)

-        continue;

-      MachineOperand &MO = MI->getOperand(Idx);

-

-      if (MO.isReg()) {

-        if (!RI.isSGPRClass(MRI.getRegClass(MO.getReg())))

-          continue; // VGPRs are legal

-

-        assert(MO.getReg() != AMDGPU::SCC && "SCC operand to VOP3 instruction");

-

-        if (SGPRReg == AMDGPU::NoRegister || SGPRReg == MO.getReg()) {

-          SGPRReg = MO.getReg();

-          // We can use one SGPR in each VOP3 instruction.

-          continue;

-        }</pre>

      </div>

    </blockquote>

    This looks like it checks that only one SGPR is used, but I don't

    think isOperandLegal can do that only looking at one operand. This

    loop could also be improved to find the operand that requires the

    fewest moves (e.g. inst s0, s1, s1 I think would end up finding s0

    first and inserting moves for both copies of s1)<br>

    <br>

    <blockquote cite="mid:20140917152131.GA22471@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

-      } else if (!isLiteralConstant(MO)) {

-        // If it is not a register and not a literal constant, then it must be

-        // an inline constant which is always legal.

-        continue;

-      }

-      // If we make it this far, then the operand is not legal and we must

-      // legalize it.

-      legalizeOpWithMove(MI, Idx);

+      if (Idx != -1 && !isOperandLegal(MI, Idx))

+        legalizeOpWithMove(MI, Idx);

     }

+    return;

   }

   // Legalize REG_SEQUENCE and PHI

<div class="moz-txt-sig">-- 

1.8.5.5

</div></pre>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"><legend

          class="mimeAttachmentHeaderName">0004-R600-SI-Remove-modifier-operands-from-V_CNDMASK_B32_.patch</legend></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">From e3ee19a2997e7634991ee5950e8d4eb9e3448d97 Mon Sep 17 00:00:00 2001

From: Tom Stellard <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:thomas.stellard@amd.com"><thomas.stellard@amd.com></a>

Date: Tue, 16 Sep 2014 09:19:50 -0400

Subject: [PATCH 4/8] R600/SI: Remove modifier operands from V_CNDMASK_B32_e64

Modifiers don't work for this instruction.

---

 lib/Target/R600/SIInstructions.td   | 5 ++---

 lib/Target/R600/SILowerI1Copies.cpp | 6 +-----

 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/lib/Target/R600/SIInstructions.td b/lib/Target/R600/SIInstructions.td

index d27ddf3..8cbdc55 100644

--- a/lib/Target/R600/SIInstructions.td

+++ b/lib/Target/R600/SIInstructions.td

@@ -1239,9 +1239,8 @@ def V_CNDMASK_B32_e32 : VOP2 <0x00000000, (outs VReg_32:$dst),

 }

 def V_CNDMASK_B32_e64 : VOP3 <0x00000100, (outs VReg_32:$dst),

-  (ins VSrc_32:$src0, VSrc_32:$src1, SSrc_64:$src2,

-   InstFlag:$abs, InstFlag:$clamp, InstFlag:$omod, InstFlag:$neg),

-  "V_CNDMASK_B32_e64 $dst, $src0, $src1, $src2, $abs, $clamp, $omod, $neg",

+  (ins VSrc_32:$src0, VSrc_32:$src1, SSrc_64:$src2),

+  "V_CNDMASK_B32_e64 $dst, $src0, $src1, $src2",

   [(set i32:$dst, (select i1:$src2, i32:$src1, i32:$src0))]

 > {

   let src0_modifiers = 0;

diff --git a/lib/Target/R600/SILowerI1Copies.cpp b/lib/Target/R600/SILowerI1Copies.cpp

index 1f0f24b..3ab0c2a 100644

--- a/lib/Target/R600/SILowerI1Copies.cpp

+++ b/lib/Target/R600/SILowerI1Copies.cpp

@@ -127,11 +127,7 @@ bool SILowerI1Copies::runOnMachineFunction(MachineFunction &MF) {

                 .addOperand(MI.getOperand(0))

                 .addImm(0)

                 .addImm(-1)

-                .addOperand(MI.getOperand(1))

-                .addImm(0)

-                .addImm(0)

-                .addImm(0)

-                .addImm(0);

+                .addOperand(MI.getOperand(1));

         MI.eraseFromParent();

       } else if (TRI->getCommonSubClass(DstRC, &AMDGPU::SGPR_64RegClass) &&

                  SrcRC == &AMDGPU::VReg_1RegClass) {

<div class="moz-txt-sig">-- 

1.8.5.5

</div></pre>

      </div>

    </blockquote>

    LGTM<br>

    <br>

    <blockquote cite="mid:20140917152131.GA22471@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap=""><div class="moz-txt-sig">

</div></pre>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"><legend

          class="mimeAttachmentHeaderName">0005-R600-SI-Add-pattern-for-i64-ctlz_zero_undef.patch</legend></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">From 5600b629eb48753641ffac3cc73279b42e547e46 Mon Sep 17 00:00:00 2001

From: Tom Stellard <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:thomas.stellard@amd.com"><thomas.stellard@amd.com></a>

Date: Tue, 16 Sep 2014 09:18:31 -0400

Subject: [PATCH 5/8] R600/SI: Add pattern for i64 ctlz_zero_undef

---

 lib/Target/R600/SIInstrInfo.cpp      | 127 +++++++++++++++++++++++++++++++++--

 lib/Target/R600/SIInstrInfo.h        |   9 +++

 lib/Target/R600/SIInstructions.td    |  24 +++++--

 test/CodeGen/R600/ctlz_zero_undef.ll |  33 +++++++++

 4 files changed, 182 insertions(+), 11 deletions(-)

diff --git a/lib/Target/R600/SIInstrInfo.cpp b/lib/Target/R600/SIInstrInfo.cpp

index 1b90d41..ae9cbe9 100644

--- a/lib/Target/R600/SIInstrInfo.cpp

+++ b/lib/Target/R600/SIInstrInfo.cpp

@@ -510,7 +510,12 @@ bool SIInstrInfo::expandPostRAPseudo(MachineBasicBlock::iterator MI) const {

     // This is just a placeholder for register allocation.

     MI->eraseFromParent();

     break;

+

+  case AMDGPU::S_CTLZ_ZERO_UNDEF_B32_B64:

+    MI->setDesc(get(AMDGPU::S_FLBIT_I32_B64));

+    return false;

   }

+

   return true;

 }

@@ -1556,6 +1561,19 @@ void SIInstrInfo::moveSMRDToVALU(MachineInstr *MI, MachineRegisterInfo &MRI) con

   }

 }

+void SIInstrInfo::getUsesToMoveToVALU(unsigned Reg,

+               const MachineRegisterInfo &MRI,

+               SmallVectorImpl<MachineInstr *> &Worklist) const {

+

+  for (MachineRegisterInfo::use_iterator I = MRI.use_begin(Reg),

+       E = MRI.use_end(); I != E; ++I) {</pre>

      </div>

    </blockquote>

    <br>

    I think you can use range loop with MRI.use_operands(Reg) instead<br>

    <blockquote cite="mid:20140917152131.GA22471@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+    MachineInstr &UseMI = *I->getParent();

+    if (!canReadVGPR(UseMI, I.getOperandNo())) {

+      Worklist.push_back(&UseMI);

+    }

+  }

+}

+

 void SIInstrInfo::moveToVALU(MachineInstr &TopInst) const {

   SmallVector<MachineInstr *, 128> Worklist;

   Worklist.push_back(&TopInst);

@@ -1624,6 +1642,16 @@ void SIInstrInfo::moveToVALU(MachineInstr &TopInst) const {

       Inst->eraseFromParent();

       continue;

+    case AMDGPU::S_CTLZ_ZERO_UNDEF_B32_B64:

+      splitScalar64BitFLBIT(Worklist, Inst, true);

+      Inst->eraseFromParent();

+      continue;

+

+    case AMDGPU::S_FLBIT_I32_B64:

+      splitScalar64BitFLBIT(Worklist, Inst);

+      Inst->eraseFromParent();

+      continue;

+

     case AMDGPU::S_BFE_U64:

     case AMDGPU::S_BFE_I64:

     case AMDGPU::S_BFM_B64:

@@ -1710,13 +1738,7 @@ void SIInstrInfo::moveToVALU(MachineInstr &TopInst) const {

     // Legalize the operands

     legalizeOperands(Inst);

-    for (MachineRegisterInfo::use_iterator I = MRI.use_begin(NewDstReg),

-           E = MRI.use_end(); I != E; ++I) {

-      MachineInstr &UseMI = *I->getParent();

-      if (!canReadVGPR(UseMI, I.getOperandNo())) {

-        Worklist.push_back(&UseMI);

-      }

-    }

+    getUsesToMoveToVALU(NewDstReg, MRI, Worklist);

   }

 }

@@ -1890,6 +1912,97 @@ void SIInstrInfo::splitScalar64BitBCNT(SmallVectorImpl<MachineInstr *> &Worklist

   Worklist.push_back(Second);

 }

+void SIInstrInfo::splitScalar64BitFLBIT(SmallVectorImpl<MachineInstr*> &Worklist,

+                                        MachineInstr *Inst,

+                                        bool IsZeroUndef) const {

+  MachineBasicBlock &MBB = *Inst->getParent();

+  MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();

+

+  MachineBasicBlock::iterator MII = Inst;

+  DebugLoc DL = Inst->getDebugLoc();

+

+  MachineOperand &Dest = Inst->getOperand(0);

+  MachineOperand &Src = Inst->getOperand(1);

+

+  const TargetRegisterClass *SrcRC = Src.isReg() ?

+    MRI.getRegClass(Src.getReg()) :

+    &AMDGPU::SGPR_64RegClass;

+

+  const TargetRegisterClass *SrcSubRC = RI.getSubRegClass(SrcRC, AMDGPU::sub0);

+

+  MachineOperand SrcRegSub0 = buildExtractSubRegOrImm(MII, MRI, Src, SrcRC,

+                                                      AMDGPU::sub0, SrcSubRC);

+  MachineOperand SrcRegSub1 = buildExtractSubRegOrImm(MII, MRI, Src, SrcRC,

+                                                      AMDGPU::sub1, SrcSubRC);

+

+

+  unsigned HiResultReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);

+  unsigned LoResultReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);

+

+  unsigned IsHiZeroReg= MRI.createVirtualRegister(&AMDGPU::SGPR_64RegClass);

+  unsigned LHSReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);

+  unsigned RHSReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);

+

+  unsigned DstReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);

+  unsigned DstFinalReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);

+

+  // S_FLBIT_I32_B64 src0

+  //

+  // if (src0.hi == 0) {

+  //   dst = V_FFBH_U32 src0.lo + 32

+  // } else {

+  //   dst = V_FFBH_U32 src0.hi + 0;

+  // }

+  //

+  // if (src0 == 0) {

+  //   dst = -1;

+  // } else {

+  //   dst = dst;

+  // }

+

+  BuildMI(MBB, MII, DL, get(AMDGPU::V_FFBH_U32_e32), HiResultReg)

+          .addReg(SrcRegSub1.getReg());

+

+  BuildMI(MBB, MII, DL, get(AMDGPU::V_FFBH_U32_e32), LoResultReg)

+          .addReg(SrcRegSub0.getReg());

+

+  BuildMI(MBB, MII, DL, get(AMDGPU::V_CMP_EQ_U32_e64), IsHiZeroReg)

+          .addImm(0)

+          .addReg(SrcRegSub1.getReg());

+

+  BuildMI(MBB, MII, DL, get(AMDGPU::V_CNDMASK_B32_e64), LHSReg)

+          .addReg(HiResultReg)

+          .addReg(LoResultReg)

+          .addReg(IsHiZeroReg);

+

+  BuildMI(MBB, MII, DL, get(AMDGPU::V_CNDMASK_B32_e64), RHSReg)

+          .addImm(0)

+          .addImm(32)

+          .addReg(IsHiZeroReg);

+

+  BuildMI(MBB, MII, DL, get(AMDGPU::V_ADD_I32_e32), DstReg)

+          .addReg(LHSReg)

+          .addReg(RHSReg);

+

+  if (!IsZeroUndef) {

+    unsigned IsSrcZeroReg = MRI.createVirtualRegister(&AMDGPU::SGPR_64RegClass);

+    BuildMI(MBB, MII, DL, get(AMDGPU::V_CMP_EQ_U64_e64), IsSrcZeroReg)

+            .addImm(0)

+            .addReg(Src.getReg());

+

+    BuildMI(MBB, MII, DL, get(AMDGPU::V_CNDMASK_B32_e64), DstFinalReg)

+            .addReg(DstReg)

+            .addImm(-1)

+            .addReg(IsSrcZeroReg);

+  } else {

+    DstFinalReg = DstReg;

+  }

+

+  MRI.replaceRegWith(Dest.getReg(), DstFinalReg);

+

+  getUsesToMoveToVALU(DstFinalReg, MRI, Worklist);

+}

+

 void SIInstrInfo::addDescImplicitUseDef(const MCInstrDesc &NewDesc,

                                         MachineInstr *Inst) const {

   // Add the implict and explicit register definitions.

diff --git a/lib/Target/R600/SIInstrInfo.h b/lib/Target/R600/SIInstrInfo.h

index a32318a..be8776d 100644

--- a/lib/Target/R600/SIInstrInfo.h

+++ b/lib/Target/R600/SIInstrInfo.h

@@ -53,6 +53,10 @@ private:

   void splitScalar64BitBCNT(SmallVectorImpl<MachineInstr *> &Worklist,

                             MachineInstr *Inst) const;

+  void splitScalar64BitFLBIT(SmallVectorImpl<MachineInstr *> &Worklist,

+                             MachineInstr *Inst,

+                             bool IsZeroUndef = false) const;

+

   void addDescImplicitUseDef(const MCInstrDesc &Desc, MachineInstr *MI) const;

 public:

@@ -182,6 +186,11 @@ public:

   void moveSMRDToVALU(MachineInstr *MI, MachineRegisterInfo &MRI) const;

+  /// \brief Look at all the uses of \p Reg and add use instructions that need

+  /// to be moved to the VALU to \p Worklist.

+  void getUsesToMoveToVALU(unsigned Reg, const MachineRegisterInfo &MRI,

+                           SmallVectorImpl<MachineInstr *> &Worklist) const;

+

   /// \brief Replace this instruction's opcode with the equivalent VALU

   /// opcode.  This function will also move the users of \p MI to the

   /// VALU if necessary.

diff --git a/lib/Target/R600/SIInstructions.td b/lib/Target/R600/SIInstructions.td

index 8cbdc55..88147bf 100644

--- a/lib/Target/R600/SIInstructions.td

+++ b/lib/Target/R600/SIInstructions.td

@@ -119,7 +119,7 @@ def S_FLBIT_I32_B32 : SOP1_32 <0x00000015, "S_FLBIT_I32_B32",

   [(set i32:$dst, (ctlz_zero_undef i32:$src0))]

 >;

-//def S_FLBIT_I32_B64 : SOP1_32 <0x00000016, "S_FLBIT_I32_B64", []>;

+def S_FLBIT_I32_B64 : SOP1_32_64 <0x00000016, "S_FLBIT_I32_B64", []>;

 def S_FLBIT_I32 : SOP1_32 <0x00000017, "S_FLBIT_I32", []>;

 //def S_FLBIT_I32_I64 : SOP1_32 <0x00000018, "S_FLBIT_I32_I64", []>;

 def S_SEXT_I32_I8 : SOP1_32 <0x00000019, "S_SEXT_I32_I8",

@@ -287,6 +287,19 @@ def S_BFE_I64 : SOP2_64 <0x0000002a, "S_BFE_I64", []>;

 //def S_CBRANCH_G_FORK : SOP2_ <0x0000002b, "S_CBRANCH_G_FORK", []>;

 def S_ABSDIFF_I32 : SOP2_32 <0x0000002c, "S_ABSDIFF_I32", []>;

+

+let isPseudo = 1 in {

+

+// We select ctlz_zero_undef to this pseudo instruction rather

+// than S_FLBIT_I32_B64, so that in the event we need to move it to

+// the VGPR, we can produce a more optimized VALU version since we</pre>

      </div>

    </blockquote>

    Typo, VGPR -> VALU<br>

    <blockquote cite="mid:20140917152131.GA22471@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+// know that zero inputs are undefined.

+def S_CTLZ_ZERO_UNDEF_B32_B64 : SOP2 <0,

+  (outs SReg_32:$dst), (ins SReg_64:$src0), "", []

+>;

+

+}

+

 //===----------------------------------------------------------------------===//

 // SOPC Instructions

 //===----------------------------------------------------------------------===//

@@ -1845,13 +1858,16 @@ def : Pat <

 // SOP1 Patterns

 //===----------------------------------------------------------------------===//

-def : Pat <

-  (i64 (ctpop i64:$src)),

+class Sop3264Pat <SDNode node, Instruction inst> : Pat <</pre>

      </div>

    </blockquote>

    The naming convention other places seems to be to add _s between

    numbers<br>

    <blockquote cite="mid:20140917152131.GA22471@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+  (i64 (node i64:$src)),

   (INSERT_SUBREG (INSERT_SUBREG (i64 (IMPLICIT_DEF)),

-    (S_BCNT1_I32_B64 $src), sub0),

+    (inst $src), sub0),

     (S_MOV_B32 0), sub1)

 >;

+def : Sop3264Pat <ctpop, S_BCNT1_I32_B64>;

+def : Sop3264Pat <ctlz_zero_undef, S_CTLZ_ZERO_UNDEF_B32_B64>;

+

 //===----------------------------------------------------------------------===//

 // SOP2 Patterns

 //===----------------------------------------------------------------------===//

diff --git a/test/CodeGen/R600/ctlz_zero_undef.ll b/test/CodeGen/R600/ctlz_zero_undef.ll

index 1340ef9..1740bd9 100644

--- a/test/CodeGen/R600/ctlz_zero_undef.ll

+++ b/test/CodeGen/R600/ctlz_zero_undef.ll

@@ -4,6 +4,9 @@

 declare i32 @llvm.ctlz.i32(i32, i1) nounwind readnone

 declare <2 x i32> @llvm.ctlz.v2i32(<2 x i32>, i1) nounwind readnone

 declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1) nounwind readnone

+declare i64 @llvm.ctlz.i64(i64, i1) nounwind readnone

+declare i32 @llvm.r600.read.tidig.x() nounwind readnone

+

 ; FUNC-LABEL: @s_ctlz_zero_undef_i32:

 ; SI: S_LOAD_DWORD [[VAL:s[0-9]+]],

@@ -68,3 +71,33 @@ define void @v_ctlz_zero_undef_v4i32(<4 x i32> addrspace(1)* noalias %out, <4 x

   store <4 x i32> %ctlz, <4 x i32> addrspace(1)* %out, align 16

   ret void

 }

+

+; FUNC-LABEL: @v_ctlz_zero_undef_i64:

+; SI: S_FLBIT_I32_B64

+; EG: FFBH_UINT

+; EG: FFBH_UINT

+define void @v_ctlz_zero_undef_i64(i64 addrspace(1)* noalias %out, i64 %val) nounwind {

+  %ctlz = call i64 @llvm.ctlz.i64(i64 %val, i1 true) nounwind readnone

+  store i64 %ctlz, i64 addrspace(1)* %out, align 4

+  ret void

+}

+

+; FUNC-LABEL: @v_ctlz_zero_undef_i64_vgpr:

+; SI-DAG: V_FFBH_U32_e32

+; SI-DAG: V_FFBH_U32_e32

+; SI-DAG: V_CMP_EQ_U32_e64

+; SI: V_CNDMASK_B32_e64

+; SI: V_CNDMASK_B32_e64

+; SI: V_ADD_I32_e32

+; SI-NOT: V_CNDMASK_B32_e64

+; SI: S_ENDPGM

+; EG: FFBH_UINT

+; EG: FFBH_UINT

+define void @v_ctlz_zero_undef_i64_vgpr(i64 addrspace(1)* noalias %out, i64 %val) nounwind {

+  %tidig = call i32 @llvm.r600.read.tidig.x()

+  %zext = zext i32 %tidig to i64

+  %input = add i64 %val, %zext

+  %ctlz = call i64 @llvm.ctlz.i64(i64 %input, i1 true) nounwind readnone

+  store i64 %ctlz, i64 addrspace(1)* %out, align 4

+  ret void

+}

<div class="moz-txt-sig">-- 

1.8.5.5

</div></pre>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"><legend

          class="mimeAttachmentHeaderName">0006-IntegerDivision-Handle-vectors-in-expandDivision-and.patch</legend></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">From 6956e902b0aef46d6c71a6b58995aa22d62acd1f Mon Sep 17 00:00:00 2001

From: Tom Stellard <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:thomas.stellard@amd.com"><thomas.stellard@amd.com></a>

Date: Tue, 16 Sep 2014 14:23:42 -0400

Subject: [PATCH 6/8] IntegerDivision: Handle vectors in expandDivision() and

 expandRemainder()

This will be used and tested in a future commit to the R600 backend.

---

 lib/Transforms/Utils/IntegerDivision.cpp | 45 +++++++++++++++++++++++++++++---

 1 file changed, 41 insertions(+), 4 deletions(-)

diff --git a/lib/Transforms/Utils/IntegerDivision.cpp b/lib/Transforms/Utils/IntegerDivision.cpp

index 9f91eeb..66133f6 100644

--- a/lib/Transforms/Utils/IntegerDivision.cpp

+++ b/lib/Transforms/Utils/IntegerDivision.cpp

@@ -366,6 +366,31 @@ static Value *generateUnsignedDivisionCode(Value *Dividend, Value *Divisor,

   return Q_5;

 }

+static void splitVector(BinaryOperator *I,

+                        SmallVectorImpl<BinaryOperator*> &Scalars) {

+

+  Type *Ty = I->getType();

+  unsigned NumElements = Ty->getVectorNumElements();

+

+  IRBuilder<> Builder(I);

+

+  Value *Op0 = I->getOperand(0);

+  Value *Op1 = I->getOperand(1);

+  Type *I32Ty = Type::getInt32Ty(I->getContext());

+  Value *Vec = UndefValue::get(Ty);

+  for (unsigned i = 0, e = NumElements; i != e; ++i) {

+    Value *Idx = Constant::getIntegerValue(I32Ty, APInt(32, i));</pre>

      </div>

    </blockquote>

    ConstantInt::get(I32Ty, i) is shorter<br>

    <blockquote cite="mid:20140917152131.GA22471@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+    Value *LHS = Builder.CreateExtractElement(Op0, Idx);

+    Value *RHS = Builder.CreateExtractElement(Op1, Idx);

+    Value *Scalar = Builder.CreateBinOp(I->getOpcode(), LHS, RHS);

+    Vec = Builder.CreateInsertElement(Vec, Scalar, Idx);

+    Scalars.push_back(cast<BinaryOperator>(Scalar));

+  }

+  I->replaceAllUsesWith(Vec);

+  I->dropAllReferences();

+  I->eraseFromParent();

+}

+

 /// Generate code to calculate the remainder of two integers, replacing Rem with

 /// the generated code. This currently generates code using the udiv expansion,

 /// but future work includes generating more specialized code, e.g. when more

@@ -381,8 +406,14 @@ bool llvm::expandRemainder(BinaryOperator *Rem) {

   IRBuilder<> Builder(Rem);

   Type *RemTy = Rem->getType();

-  if (RemTy->isVectorTy())

-    llvm_unreachable("Div over vectors not supported");

+  if (RemTy->isVectorTy()) {

+    SmallVector<BinaryOperator*, 8> Scalars;

+    splitVector(Rem, Scalars);

+    for (BinaryOperator *ScalarRem : Scalars) {

+      expandRemainder(ScalarRem);

+    }

+    return true;

+  }

   unsigned RemTyBitWidth = RemTy->getIntegerBitWidth();

@@ -439,8 +470,14 @@ bool llvm::expandDivision(BinaryOperator *Div) {

   IRBuilder<> Builder(Div);

   Type *DivTy = Div->getType();

-  if (DivTy->isVectorTy())

-    llvm_unreachable("Div over vectors not supported");

+  if (DivTy->isVectorTy()) {

+    SmallVector<BinaryOperator*, 8> Scalars;

+    splitVector(Div, Scalars);

+    for (BinaryOperator *ScalarDiv : Scalars) {

+      expandDivision(ScalarDiv);

+    }

+    return true;

+  }

   unsigned DivTyBitWidth = DivTy->getIntegerBitWidth();

<div class="moz-txt-sig">-- 

1.8.5.5

</div></pre>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"><legend

          class="mimeAttachmentHeaderName">0007-R600-Add-a-pass-for-expanding-64-bit-division.patch</legend></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">From 742ec95149d7c7d00438b80254b81ea0b0c1613f Mon Sep 17 00:00:00 2001

From: Tom Stellard <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:thomas.stellard@amd.com"><thomas.stellard@amd.com></a>

Date: Tue, 16 Sep 2014 09:21:56 -0400

Subject: [PATCH 7/8] R600: Add a pass for expanding 64-bit division</pre>

      </div>

    </blockquote>

    Should this just be moved to the generic IR passes? The other

    utilities have a corresponding pass version already, it's just weird

    that IntegerDivision doesn't and there's nothing really AMDGPU

    specific here (other than a new target provided check for if a type

    should be expanded).<br>

    <br>

    <blockquote cite="mid:20140917152131.GA22471@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

---

 lib/Target/R600/AMDGPU.h                |   1 +

 lib/Target/R600/AMDGPUExpandDIVMOD.cpp  | 107 ++++++++++++++++++++++++++++++++

 lib/Target/R600/AMDGPUTargetMachine.cpp |   1 +

 lib/Target/R600/CMakeLists.txt          |   1 +

 test/CodeGen/R600/sdiv.ll               |  43 +++++++------

 test/CodeGen/R600/sdiv_vec.ll           |  46 ++++++++++++++

 test/CodeGen/R600/udiv.ll               |  55 ++++++++++++----

 test/CodeGen/R600/udiv_vec.ll           |  47 ++++++++++++++

 test/CodeGen/R600/udivrem64.ll          |  82 ------------------------

 9 files changed, 270 insertions(+), 113 deletions(-)

 create mode 100644 lib/Target/R600/AMDGPUExpandDIVMOD.cpp

 create mode 100644 test/CodeGen/R600/sdiv_vec.ll

 create mode 100644 test/CodeGen/R600/udiv_vec.ll

 delete mode 100644 test/CodeGen/R600/udivrem64.ll

diff --git a/lib/Target/R600/AMDGPU.h b/lib/Target/R600/AMDGPU.h

index ff4d6b4..e968eba 100644

--- a/lib/Target/R600/AMDGPU.h

+++ b/lib/Target/R600/AMDGPU.h

@@ -50,6 +50,7 @@ void initializeSILowerI1CopiesPass(PassRegistry &);

 extern char &SILowerI1CopiesID;

 // Passes common to R600 and SI

+FunctionPass *createAMDGPUExpandDIVMODPass();

 FunctionPass *createAMDGPUPromoteAlloca(const AMDGPUSubtarget &ST);

 Pass *createAMDGPUStructurizeCFGPass();

 FunctionPass *createAMDGPUISelDag(TargetMachine &tm);

diff --git a/lib/Target/R600/AMDGPUExpandDIVMOD.cpp b/lib/Target/R600/AMDGPUExpandDIVMOD.cpp

new file mode 100644

index 0000000..98d997d

--- /dev/null

+++ b/lib/Target/R600/AMDGPUExpandDIVMOD.cpp

@@ -0,0 +1,107 @@

+//===-- AMDGPUExpandDIVMOD.cpp - Expand div/mod instructions --------------===//

+//

+//                     The LLVM Compiler Infrastructure

+//

+// This file is distributed under the University of Illinois Open Source

+// License. See LICENSE.TXT for details.

+//

+//===----------------------------------------------------------------------===//

+//

+/// \file

+//===----------------------------------------------------------------------===//

+

+#include "AMDGPU.h"

+#include "llvm/IR/IRBuilder.h"

+#include "llvm/IR/InstVisitor.h"

+#include "llvm/Transforms/Utils/IntegerDivision.h"

+

+#include "llvm/Support/Debug.h"

+using namespace llvm;

+

+namespace {

+

+class AMDGPUExpandDIVMOD : public FunctionPass,

+                           public InstVisitor<AMDGPUExpandDIVMOD, bool> {

+

+  static char ID;

+  std::vector<BinaryOperator *> Divs;

+  std::vector<BinaryOperator *> Rems;

+

+public:

+  AMDGPUExpandDIVMOD() : FunctionPass(ID) { }

+  bool doInitialization(Module &M) override;

+  bool runOnFunction(Function &F) override;

+  const char *getPassName() const override {

+    return "AMDGPU Expand div/mod";

+  }

+  bool visitInstruction(Instruction &I) { return false; }

+  bool visitSDiv(BinaryOperator &I);

+  bool visitUDiv(BinaryOperator &I);

+  bool visitSRem(BinaryOperator &I);

+  bool visitURem(BinaryOperator &I);

+

+};

+

+} // End anonymous namespace

+

+char AMDGPUExpandDIVMOD::ID = 0;

+

+bool AMDGPUExpandDIVMOD::doInitialization(Module &M) {

+  return false;

+}

+

+bool AMDGPUExpandDIVMOD::runOnFunction(Function &F) {

+

+  for (Function::iterator BBI = F.begin(), BBE = F.end(); BBI != BBE; ++BBI) {

+       BasicBlock *BB = BBI;

+    for (BasicBlock::iterator II = BB->begin(), IE = BB->end(); II != IE; ++II) {

+        Instruction *I = II;

+        if (visit(*I)) {

+          BBI = F.begin();

+          break;

+        }

+    }

+  }

+

+  return false;

+}

+

+static bool shouldExpandDivMod(const BinaryOperator &I) {

+  return I.getType()->getScalarType() == Type::getInt64Ty(I.getContext());

+}

+

+bool AMDGPUExpandDIVMOD::visitSDiv(BinaryOperator &I) {

+  if (shouldExpandDivMod(I)) {

+    expandDivision(&I);

+    return true;

+  }

+  return false;

+}

+

+bool AMDGPUExpandDIVMOD::visitUDiv(BinaryOperator &I) {

+  if (shouldExpandDivMod(I)) {

+    expandDivision(&I);

+    return true;

+  }

+  return false;

+}

+

+bool AMDGPUExpandDIVMOD::visitSRem(BinaryOperator &I) {

+  if (shouldExpandDivMod(I)) {

+    expandRemainder(&I);

+    return true;

+  }

+  return false;

+}

+

+bool AMDGPUExpandDIVMOD::visitURem(BinaryOperator &I) {

+  if (shouldExpandDivMod(I)) {

+    expandRemainder(&I);

+    return true;

+  }

+  return false;

+}

+

+FunctionPass *llvm::createAMDGPUExpandDIVMODPass() {

+  return new AMDGPUExpandDIVMOD();

+}

diff --git a/lib/Target/R600/AMDGPUTargetMachine.cpp b/lib/Target/R600/AMDGPUTargetMachine.cpp

index c95a941..1a95d86 100644

--- a/lib/Target/R600/AMDGPUTargetMachine.cpp

+++ b/lib/Target/R600/AMDGPUTargetMachine.cpp

@@ -119,6 +119,7 @@ void AMDGPUPassConfig::addCodeGenPrepare() {

 bool

 AMDGPUPassConfig::addPreISel() {

   const AMDGPUSubtarget &ST = TM->getSubtarget<AMDGPUSubtarget>();

+  addPass(createAMDGPUExpandDIVMODPass());

   addPass(createFlattenCFGPass());

   if (ST.IsIRStructurizerEnabled())

     addPass(createStructurizeCFGPass());

diff --git a/lib/Target/R600/CMakeLists.txt b/lib/Target/R600/CMakeLists.txt

index c5f4680..c94b4e3 100644

--- a/lib/Target/R600/CMakeLists.txt

+++ b/lib/Target/R600/CMakeLists.txt

@@ -14,6 +14,7 @@ add_public_tablegen_target(AMDGPUCommonTableGen)

 add_llvm_target(R600CodeGen

   AMDILCFGStructurizer.cpp

   AMDGPUAsmPrinter.cpp

+  AMDGPUExpandDIVMOD.cpp

   AMDGPUFrameLowering.cpp

   AMDGPUIntrinsicInfo.cpp

   AMDGPUISelDAGToDAG.cpp

diff --git a/test/CodeGen/R600/sdiv.ll b/test/CodeGen/R600/sdiv.ll

index e922d5c..3d74e90 100644

--- a/test/CodeGen/R600/sdiv.ll

+++ b/test/CodeGen/R600/sdiv.ll

@@ -81,23 +81,30 @@ define void @sdiv_v4i32_4(<4 x i32> addrspace(1)* %out, <4 x i32> addrspace(1)*

   ret void

 }

-; Tests for 64-bit divide bypass.

-; define void @test_get_quotient(i64 addrspace(1)* %out, i64 %a, i64 %b) nounwind {

-;   %result = sdiv i64 %a, %b

-;   store i64 %result, i64 addrspace(1)* %out, align 8

-;   ret void

-; }

+; For the 64-bit division, just make sure we don't crash with a 'cannot select'

+; error.

+; FUNC-LABEL: @test_get_quotient

+; SI:S_ENDPGM</pre>

      </div>

    </blockquote>

    missing space after :, and the same for the rest of these tests<br>

    <blockquote cite="mid:20140917152131.GA22471@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+define void @test_get_quotient(i64 addrspace(1)* %out, i64 %a, i64 %b) nounwind {

+  %result = sdiv i64 %a, %b

+  store i64 %result, i64 addrspace(1)* %out, align 8

+  ret void

+}

-; define void @test_get_remainder(i64 addrspace(1)* %out, i64 %a, i64 %b) nounwind {

-;   %result = srem i64 %a, %b

-;   store i64 %result, i64 addrspace(1)* %out, align 8

-;   ret void

-; }

+; FUNC-LABEL: @test_get_remainder

+; SI:S_ENDPGM</pre>

      </div>

    </blockquote>

    <blockquote cite="mid:20140917152131.GA22471@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+define void @test_get_remainder(i64 addrspace(1)* %out, i64 %a, i64 %b) nounwind {

+  %result = srem i64 %a, %b

+  store i64 %result, i64 addrspace(1)* %out, align 8

+  ret void

+}

-; define void @test_get_quotient_and_remainder(i64 addrspace(1)* %out, i64 %a, i64 %b) nounwind {

-;   %resultdiv = sdiv i64 %a, %b

-;   %resultrem = srem i64 %a, %b

-;   %result = add i64 %resultdiv, %resultrem

-;   store i64 %result, i64 addrspace(1)* %out, align 8

-;   ret void

-; }

+; FUNC-LABEL: @test_get_quotient_and_remainder

+; SI:S_ENDPGM</pre>

      </div>

    </blockquote>

    <blockquote cite="mid:20140917152131.GA22471@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+define void @test_get_quotient_and_remainder(i64 addrspace(1)* %out, i64 %a, i64 %b) nounwind {

+  %resultdiv = sdiv i64 %a, %b

+  %resultrem = srem i64 %a, %b

+  %result = add i64 %resultdiv, %resultrem

+  store i64 %result, i64 addrspace(1)* %out, align 8

+  ret void

+}

diff --git a/test/CodeGen/R600/sdiv_vec.ll b/test/CodeGen/R600/sdiv_vec.ll

new file mode 100644

index 0000000..4e8ace4

--- /dev/null

+++ b/test/CodeGen/R600/sdiv_vec.ll

@@ -0,0 +1,46 @@

+;FIXME: llc < %s -march=r600 -mcpu=redwood | FileCheck --check-prefix=EG --check-prefix=FUNC %s

+;RUN: llc < %s -march=r600 -mcpu=verde -verify-machineinstrs | FileCheck --check-prefix=SI-CHECK --check-prefix=FUNC %s

+

+; FIXME: i64 vector kernel args don't work on Evergeen/NI.

+

+; FUNC-LABEL: @test_get_quotient_and_remainder_v2

+; SI:S_ENDPGM</pre>

      </div>

    </blockquote>

    <blockquote cite="mid:20140917152131.GA22471@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+define void @test_get_quotient_and_remainder_v2(<2 x i64> addrspace(1)* %out, <2 x i64> %a, <2 x i64> %b) nounwind {

+  %resultdiv = sdiv <2 x i64> %a, %b

+  %resultrem = srem <2 x i64> %a, %b

+  %result = add <2 x i64> %resultdiv, %resultrem

+  store <2 x i64> %result, <2 x i64> addrspace(1)* %out, align 8

+  ret void

+}

+

+; FUNC-LABEL: @test_get_quotient_and_remainder_v4

+; SI:S_ENDPGM

+define void @test_get_quotient_and_remainder_v4(<4 x i64> addrspace(1)* %out, <4 x i64> %a, <4 x i64> %b) nounwind {

+  %resultdiv = sdiv <4 x i64> %a, %b

+  %resultrem = srem <4 x i64> %a, %b

+  %result = add <4 x i64> %resultdiv, %resultrem

+  store <4 x i64> %result, <4 x i64> addrspace(1)* %out, align 8

+  ret void

+}

+

+; FUNC-LABEL: @test_get_quotient_and_remainder_v8

+; SI:S_ENDPGM

+define void @test_get_quotient_and_remainder_v8( <8 x i64> addrspace(1)* %out, <8 x i64> %a, <8 x i64> %b) nounwind {

+  %resultdiv = sdiv <8 x i64> %a, %b

+  %resultrem = srem <8 x i64> %a, %b

+  %result = add <8 x i64> %resultdiv, %resultrem

+  store <8 x i64> %result, <8 x i64> addrspace(1)* %out, align 8

+  ret void

+}

+

+; FIXME: The v16 case causes machine verifier errors.  I think this is related

+; to register spilling.

+; FIXME-FUNC-LABEL: @test_get_quotient_and_remainder_v16

+; FIXME-SI:S_ENDPGM

+;define void @test_get_quotient_and_remainder_v16(<16 x i64> addrspace(1)* %out, <16 x i64> %a, <16 x i64> %b) nounwind {

+;  %resultdiv = sdiv <16 x i64> %a, %b

+;  %resultrem = srem <16 x i64> %a, %b

+;  %result = add <16 x i64> %resultdiv, %resultrem

+;  store <16 x i64> %result, <16 x i64> addrspace(1)* %out, align 8

+;  ret void

+;}

diff --git a/test/CodeGen/R600/udiv.ll b/test/CodeGen/R600/udiv.ll

index 5371321..b483e76 100644

--- a/test/CodeGen/R600/udiv.ll

+++ b/test/CodeGen/R600/udiv.ll

@@ -1,9 +1,9 @@

-;RUN: llc < %s -march=r600 -mcpu=redwood | FileCheck --check-prefix=EG-CHECK %s

-;RUN: llc < %s -march=r600 -mcpu=verde -verify-machineinstrs | FileCheck --check-prefix=SI-CHECK %s

+;FIXME: llc < %s -march=r600 -mcpu=redwood | FileCheck --check-prefix=EG --check-prefix=FUNC %s

+;RUN: llc < %s -march=r600 -mcpu=verde -verify-machineinstrs | FileCheck --check-prefix=SI --check-prefix=FUNC %s

-;EG-CHECK-LABEL: @test

-;EG-CHECK-NOT: SETGE_INT

-;EG-CHECK: CF_END

+;EG-LABEL: @test

+;EG-NOT: SETGE_INT

+;EG: CF_END

 define void @test(i32 addrspace(1)* %out, i32 addrspace(1)* %in) {

   %b_ptr = getelementptr i32 addrspace(1)* %in, i32 1

@@ -18,10 +18,10 @@ define void @test(i32 addrspace(1)* %out, i32 addrspace(1)* %in) {

 ;The goal of this test is to make sure the ISel doesn't fail when it gets

 ;a v4i32 udiv

-;EG-CHECK-LABEL: @test2

-;EG-CHECK: CF_END

-;SI-CHECK-LABEL: @test2

-;SI-CHECK: S_ENDPGM

+;EG-LABEL: @test2

+;EG: CF_END

+;SI-LABEL: @test2

+;SI: S_ENDPGM

 define void @test2(<2 x i32> addrspace(1)* %out, <2 x i32> addrspace(1)* %in) {

   %b_ptr = getelementptr <2 x i32> addrspace(1)* %in, i32 1

@@ -32,10 +32,10 @@ define void @test2(<2 x i32> addrspace(1)* %out, <2 x i32> addrspace(1)* %in) {

   ret void

 }

-;EG-CHECK-LABEL: @test4

-;EG-CHECK: CF_END

-;SI-CHECK-LABEL: @test4

-;SI-CHECK: S_ENDPGM

+;EG-LABEL: @test4

+;EG: CF_END

+;SI-LABEL: @test4

+;SI: S_ENDPGM

 define void @test4(<4 x i32> addrspace(1)* %out, <4 x i32> addrspace(1)* %in) {

   %b_ptr = getelementptr <4 x i32> addrspace(1)* %in, i32 1

@@ -45,3 +45,32 @@ define void @test4(<4 x i32> addrspace(1)* %out, <4 x i32> addrspace(1)* %in) {

   store <4 x i32> %result, <4 x i32> addrspace(1)* %out

   ret void

 }

+

+; For the 64-bit division, just make sure we don't crash with a 'cannot select'

+; error.

+; FUNC-LABEL: @test_get_quotient

+; SI: S_ENDPGM

+define void @test_get_quotient(i64 addrspace(1)* %out, i64 %a, i64 %b) nounwind {

+  %result = udiv i64 %a, %b

+  store i64 %result, i64 addrspace(1)* %out, align 8

+  ret void

+}

+

+; FIXME: The AMDILCFGStructurizer crashes on this function for redwood.

+; FUNC-LABEL: @test_get_remainder

+; SI: S_ENDPGM

+define void @test_get_remainder(i64 addrspace(1)* %out, i64 %a, i64 %b) nounwind {

+  %result = urem i64 %a, %b

+  store i64 %result, i64 addrspace(1)* %out, align 8

+  ret void

+}

+

+; FUNC-LABEL: @test_get_quotient_and_remainder

+; SI: S_ENDPGM

+define void @test_get_quotient_and_remainder(i64 addrspace(1)* %out, i64 %a, i64 %b) nounwind {

+  %resultdiv = udiv i64 %a, %b

+  %resultrem = urem i64 %a, %b

+  %result = add i64 %resultdiv, %resultrem

+  store i64 %result, i64 addrspace(1)* %out, align 8

+  ret void

+}

diff --git a/test/CodeGen/R600/udiv_vec.ll b/test/CodeGen/R600/udiv_vec.ll

new file mode 100644

index 0000000..942c323

--- /dev/null

+++ b/test/CodeGen/R600/udiv_vec.ll

@@ -0,0 +1,47 @@

+;FIXME: llc < %s -march=r600 -mcpu=redwood | FileCheck --check-prefix=EG %s

+;RUN: llc < %s -march=r600 -mcpu=verde -verify-machineinstrs | FileCheck --check-prefix=SI %s

+

+; FIXME: i64 vector kernel args don't work on Evergeen/NI.

+

+; FUNC-LABEL: @test_get_quotient_and_remainder_v2

+; SI:S_ENDPGM

+define void @test_get_quotient_and_remainder_v2(<2 x i64> addrspace(1)* %out, <2 x i64> %a, <2 x i64> %b) nounwind {

+  %resultdiv = udiv <2 x i64> %a, %b

+  %resultrem = urem <2 x i64> %a, %b

+  %result = add <2 x i64> %resultdiv, %resultrem

+  store <2 x i64> %result, <2 x i64> addrspace(1)* %out, align 8

+  ret void

+}

+

+; FUNC-LABEL: @test_get_quotient_and_remainder_v4

+; SI:S_ENDPGM

+define void @test_get_quotient_and_remainder_v4(<4 x i64> addrspace(1)* %out, <4 x i64> %a, <4 x i64> %b) nounwind {

+  %resultdiv = udiv <4 x i64> %a, %b

+  %resultrem = urem <4 x i64> %a, %b

+  %result = add <4 x i64> %resultdiv, %resultrem

+  store <4 x i64> %result, <4 x i64> addrspace(1)* %out, align 8

+  ret void

+}

+

+; FUNC-LABEL: @test_get_quotient_and_remainder_v8

+; SI:S_ENDPGM

+define void @test_get_quotient_and_remainder_v8( <8 x i64> addrspace(1)* %out, <8 x i64> %a, <8 x i64> %b) nounwind {

+  %resultdiv = udiv <8 x i64> %a, %b

+  %resultrem = urem <8 x i64> %a, %b

+  %result = add <8 x i64> %resultdiv, %resultrem

+  store <8 x i64> %result, <8 x i64> addrspace(1)* %out, align 8

+  ret void

+}

+

+; FIXME: The v16 case causes machine verifier errors.  I think this is related

+; to register spilling.

+

+; FIXME-FUNC-LABEL: @test_get_quotient_and_remainder_v16

+; FIXME-SI:S_ENDPGM

+;define void @test_get_quotient_and_remainder_v16(<16 x i64> addrspace(1)* %out, <16 x i64> %a, <16 x i64> %b) nounwind {

+;  %resultdiv = udiv <16 x i64> %a, %b

+;  %resultrem = urem <16 x i64> %a, %b

+;  %result = add <16 x i64> %resultdiv, %resultrem

+;  store <16 x i64> %result, <16 x i64> addrspace(1)* %out, align 8

+;  ret void

+;}

diff --git a/test/CodeGen/R600/udivrem64.ll b/test/CodeGen/R600/udivrem64.ll

deleted file mode 100644

index a71315a..0000000

--- a/test/CodeGen/R600/udivrem64.ll

+++ /dev/null

@@ -1,82 +0,0 @@

-;XUN: llc < %s -march=r600 -mcpu=SI -verify-machineinstrs| FileCheck --check-prefix=SI --check-prefix=FUNC %s

-;RUN: llc < %s -march=r600 -mcpu=redwood | FileCheck --check-prefix=EG --check-prefix=FUNC %s

-

-;FUNC-LABEL: @test_udiv

-;EG: RECIP_UINT

-;EG: LSHL {{.*}}, 1,

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;SI: S_ENDPGM

-define void @test_udiv(i64 addrspace(1)* %out, i64 %x, i64 %y) {

-  %result = udiv i64 %x, %y

-  store i64 %result, i64 addrspace(1)* %out

-  ret void

-}

-

-;FUNC-LABEL: @test_urem

-;EG: RECIP_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: BFE_UINT

-;EG: AND_INT {{.*}}, 1,

-;SI: S_ENDPGM

-define void @test_urem(i64 addrspace(1)* %out, i64 %x, i64 %y) {

-  %result = urem i64 %x, %y

-  store i64 %result, i64 addrspace(1)* %out

-  ret void

-}

<div class="moz-txt-sig">-- 

1.8.5.5

</div></pre>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"><legend

          class="mimeAttachmentHeaderName">0008-R600-Factor-i64-UDIVREM-lowering-into-its-own-fuctio.patch</legend></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">From 1fbfb38377036de2a1e32ebcda911dadea252994 Mon Sep 17 00:00:00 2001

From: Tom Stellard <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:thomas.stellard@amd.com"><thomas.stellard@amd.com></a>

Date: Tue, 16 Sep 2014 14:59:51 -0400

Subject: [PATCH 8/8] R600: Factor i64 UDIVREM lowering into its own fuction

This is so it could potentially be used by SI.  Howerver, the current

implemtation does not always produce correct results, so the</pre>

      </div>

    </blockquote>

    Typo "implemtation"<br>

    <blockquote cite="mid:20140917152131.GA22471@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

AMDGPUExpandDIVMOD pass is being used instead.</pre>

      </div>

    </blockquote>

    Is the output from this better than what you get from the expand

    divmod pass? I would expect so since I thought the pass inserts

    branching.<br>

    <br>

    What kind of bugs? Does it not work for a certain range of values?<br>

    <br>

    <blockquote cite="mid:20140917152131.GA22471@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

---

 lib/Target/R600/AMDGPUISelLowering.cpp | 84 ++++++++++++++++++++++++++++++++++

 lib/Target/R600/AMDGPUISelLowering.h   |  2 +

 lib/Target/R600/R600ISelLowering.cpp   |  1 +

 3 files changed, 87 insertions(+)

diff --git a/lib/Target/R600/AMDGPUISelLowering.cpp b/lib/Target/R600/AMDGPUISelLowering.cpp

index f353c94..c66bb7e 100644

--- a/lib/Target/R600/AMDGPUISelLowering.cpp

+++ b/lib/Target/R600/AMDGPUISelLowering.cpp

@@ -1485,11 +1485,95 @@ SDValue AMDGPUTargetLowering::LowerDIVREM24(SDValue Op, SelectionDAG &DAG, bool

   return DAG.getMergeValues(Res, DL);

 }

+/// XXX: FIXME This function appears to have some bugs and does not always

+/// produce correct results.  It is currently superseded by the

+/// AMDGPUExpandDIVREM pass.

+void AMDGPUTargetLowering::LowerUDIVREM64(SDValue Op,

+                                      SelectionDAG &DAG,

+                                      SmallVectorImpl<SDValue> &Results) const {

+  assert(Op.getValueType() == MVT::i64);

+

+  SDLoc DL(Op);

+  EVT VT = Op.getValueType();

+  EVT HalfVT = VT.getHalfSizedIntegerVT(*DAG.getContext());

+

+  SDValue one = DAG.getConstant(1, HalfVT);

+  SDValue zero = DAG.getConstant(0, HalfVT);

+

+  //HiLo split

+  SDValue LHS = Op.getOperand(0);

+  SDValue LHS_Lo = DAG.getNode(ISD::EXTRACT_ELEMENT, DL, HalfVT, LHS, zero);

+  SDValue LHS_Hi = DAG.getNode(ISD::EXTRACT_ELEMENT, DL, HalfVT, LHS, one);

+

+  SDValue RHS = Op.getOperand(1);

+  SDValue RHS_Lo = DAG.getNode(ISD::EXTRACT_ELEMENT, DL, HalfVT, RHS, zero);

+  SDValue RHS_Hi = DAG.getNode(ISD::EXTRACT_ELEMENT, DL, HalfVT, RHS, one);

+

+  // Get Speculative values

+  SDValue DIV_Part = DAG.getNode(ISD::UDIV, DL, HalfVT, LHS_Hi, RHS_Lo);

+  SDValue REM_Part = DAG.getNode(ISD::UREM, DL, HalfVT, LHS_Hi, RHS_Lo);

+

+  SDValue REM_Hi = zero;

+  SDValue REM_Lo = DAG.getSelectCC(DL, RHS_Hi, zero, REM_Part, LHS_Hi, ISD::SETEQ);

+

+  SDValue DIV_Hi = DAG.getSelectCC(DL, RHS_Hi, zero, DIV_Part, zero, ISD::SETEQ);

+  SDValue DIV_Lo = zero;

+

+  const unsigned halfBitWidth = HalfVT.getSizeInBits();

+

+  for (unsigned i = 0; i < halfBitWidth; ++i) {

+    SDValue POS = DAG.getConstant(halfBitWidth - i - 1, HalfVT);

+    // Get Value of high bit

+    SDValue HBit;

+    if (halfBitWidth == 32 && Subtarget->hasBFE()) {

+      HBit = DAG.getNode(AMDGPUISD::BFE_U32, DL, HalfVT, LHS_Lo, POS, one);

+    } else {

+      HBit = DAG.getNode(ISD::SRL, DL, HalfVT, LHS_Lo, POS);

+      HBit = DAG.getNode(ISD::AND, DL, HalfVT, HBit, one);

+    }

+

+    SDValue Carry = DAG.getNode(ISD::SRL, DL, HalfVT, REM_Lo,

+      DAG.getConstant(halfBitWidth - 1, HalfVT));

+    REM_Hi = DAG.getNode(ISD::SHL, DL, HalfVT, REM_Hi, one);

+    REM_Hi = DAG.getNode(ISD::OR, DL, HalfVT, REM_Hi, Carry);

+

+    REM_Lo = DAG.getNode(ISD::SHL, DL, HalfVT, REM_Lo, one);

+    REM_Lo = DAG.getNode(ISD::OR, DL, HalfVT, REM_Lo, HBit);

+

+

+    SDValue REM = DAG.getNode(ISD::BUILD_PAIR, DL, VT, REM_Lo, REM_Hi);

+

+    SDValue BIT = DAG.getConstant(1 << (halfBitWidth - i - 1), HalfVT);

+    SDValue realBIT = DAG.getSelectCC(DL, REM, RHS, BIT, zero, ISD::SETGE);

+

+    DIV_Lo = DAG.getNode(ISD::OR, DL, HalfVT, DIV_Lo, realBIT);

+

+    // Update REM

+

+    SDValue REM_sub = DAG.getNode(ISD::SUB, DL, VT, REM, RHS);

+

+    REM = DAG.getSelectCC(DL, REM, RHS, REM_sub, REM, ISD::SETGE);

+    REM_Lo = DAG.getNode(ISD::EXTRACT_ELEMENT, DL, HalfVT, REM, zero);

+    REM_Hi = DAG.getNode(ISD::EXTRACT_ELEMENT, DL, HalfVT, REM, one);

+  }

+

+  SDValue REM = DAG.getNode(ISD::BUILD_PAIR, DL, VT, REM_Lo, REM_Hi);

+  SDValue DIV = DAG.getNode(ISD::BUILD_PAIR, DL, VT, DIV_Lo, DIV_Hi);

+  Results.push_back(DIV);

+  Results.push_back(REM);

+}

+

 SDValue AMDGPUTargetLowering::LowerUDIVREM(SDValue Op,

                                            SelectionDAG &DAG) const {

   SDLoc DL(Op);

   EVT VT = Op.getValueType();

+  if (VT == MVT::i64) {

+    SmallVector<SDValue, 2> Results;

+    LowerUDIVREM64(Op, DAG, Results);

+    return DAG.getMergeValues(Results, DL);

+  }

+

   SDValue Num = Op.getOperand(0);

   SDValue Den = Op.getOperand(1);

diff --git a/lib/Target/R600/AMDGPUISelLowering.h b/lib/Target/R600/AMDGPUISelLowering.h

index fc4c006..e94c333 100644

--- a/lib/Target/R600/AMDGPUISelLowering.h

+++ b/lib/Target/R600/AMDGPUISelLowering.h

@@ -83,6 +83,8 @@ protected:

   SDValue LowerSTORE(SDValue Op, SelectionDAG &DAG) const;

   SDValue LowerSDIVREM(SDValue Op, SelectionDAG &DAG) const;

   SDValue LowerDIVREM24(SDValue Op, SelectionDAG &DAG, bool sign) const;

+  void LowerUDIVREM64(SDValue Op, SelectionDAG &DAG,

+                                    SmallVectorImpl<SDValue> &Results) const;

   bool isHWTrueValue(SDValue Op) const;

   bool isHWFalseValue(SDValue Op) const;

diff --git a/lib/Target/R600/R600ISelLowering.cpp b/lib/Target/R600/R600ISelLowering.cpp

index 3bc8cb9..04ba910 100644

--- a/lib/Target/R600/R600ISelLowering.cpp

+++ b/lib/Target/R600/R600ISelLowering.cpp

@@ -901,6 +901,7 @@ void R600TargetLowering::ReplaceNodeResults(SDNode *N,

   }

   case ISD::UDIVREM: {

     SDValue Op = SDValue(N, 0);

+    LowerUDIVREM64(Op, DAG, Results);

     SDLoc DL(Op);

     EVT VT = Op.getValueType();

     EVT HalfVT = VT.getHalfSizedIntegerVT(*DAG.getContext());

<div class="moz-txt-sig">-- 

1.8.5.5

</div></pre>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">_______________________________________________

llvm-commits mailing list

<a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a>

<a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a>

</pre>

      </div>

    </blockquote>

    <br>

  </body>

</html>