<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Aug 12, 2015 at 10:41 AM, Sanjay Patel <span dir="ltr"><<a href="mailto:spatel@rotateright.com" target="_blank">spatel@rotateright.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div><div><div><div><div><div><div><div><div>Hi Sean -<br></div><br>I think your example shows 3 possible improvements. Let me know if this looks right, and I'll file some bugs to track them:<br></div><br>1. When addressing multiple places beyond an imm8, load a register with a base and use SIB addressing to bring the offsets within 8-bits. So the example would end up looking something like this:<br></div>   movl   $376, %ebx<br></div>   movq   $0, (%rax, %rbx)<br></div>   movq   $0, 8(%rax, %rbx)<br></div>   movq   $0, 16(%rax, %rbx)<br>...<br><br></div></div></div></div></div></div></div></div></div></blockquote><div><br></div><div>Here's a quick overview of alternatives:</div><div><br></div><div><span style="font-family:Menlo;font-size:12px">33 bytes (each mov to mem pays imm32 for $0 and imm32 for the offset)</span><br></div><div><p style="margin:0px;font-size:12px;font-family:Menlo">        movq    $0, 376(%rax)           ## encoding: [0x48,0xc7,0x80,0x78,0x01,0x00,0x00,0x00,0x00,0x00,0x00]</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">        movq    $0, 384(%rax)           ## encoding: [0x48,0xc7,0x80,0x80,0x01,0x00,0x00,0x00,0x00,0x00,0x00]</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">        movq    $0, 392(%rax)           ## encoding: [0x48,0xc7,0x80,0x88,0x01,0x00,0x00,0x00,0x00,0x00,0x00]</p>

<p style="margin:0px;font-size:12px;font-family:Menlo;min-height:14px">31 bytes</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">        movl    $376, %ebx              ## encoding: [0xbb,0x78,0x01,0x00,0x00]</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">        movq    $0, (%rax,%rbx)         ## encoding: [0x48,0xc7,0x04,0x18,0x00,0x00,0x00,0x00]</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">        movq    $0, 8(%rax,%rbx)        ## encoding: [0x48,0xc7,0x44,0x18,0x08,0x00,0x00,0x00,0x00]</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">        movq    $0, 16(%rax,%rbx)       ## encoding: [0x48,0xc7,0x44,0x18,0x10,0x00,0x00,0x00,0x00]</p>

<p style="margin:0px;font-size:12px;font-family:Menlo;min-height:14px">23 bytes<br></p>

<p style="margin:0px;font-size:12px;font-family:Menlo">        xorl    %ebx, %ebx              ## encoding: [0x31,0xdb]</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">        movq    %rbx, 376(%rax)         ## encoding: [0x48,0x89,0x98,0x78,0x01,0x00,0x00]</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">        movq    %rbx, 384(%rax)         ## encoding: [0x48,0x89,0x98,0x80,0x01,0x00,0x00]</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">        movq    %rbx, 392(%rax)         ## encoding: [0x48,0x89,0x98,0x88,0x01,0x00,0x00]</p>

<p style="margin:0px;font-size:12px;font-family:Menlo;min-height:14px">21 bytes<br></p>

<p style="margin:0px;font-size:12px;font-family:Menlo">        xorl    %ebx, %ebx              ## encoding: [0x31,0xdb]</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">        movl    $376, %ecx              ## encoding: [0xb9,0x78,0x01,0x00,0x00]</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">        movq    %rbx, (%rax,%rcx)       ## encoding: [0x48,0x89,0x1c,0x08]</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">        movq    %rbx, 8(%rax,%rcx)      ## encoding: [0x48,0x89,0x5c,0x08,0x08]</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">        movq    %rbx, 16(%rax,%rcx)     ## encoding: [0x48,0x89,0x5c,0x08,0x10]</p></div><div><br></div><div>As you can see, saving the immediate $0 is always a win for size since a 2 byte xor reg,reg pays for itself saving imm32 -> imm8. With just 3  of the evil huge mov's, it is hard to amortize the cost of a mov r32,imm32 (5 bytes) for the Index of the SIB, but we barely squeeze by.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div><div><div></div><div>Not sure we can do this unless optimizing for size. See below.<br></div><div><br></div>2. When storing the same immediate multiple times, use a reg to hold the immediate even when not optimizing specifically for size (this was mentioned as a follow-on in D11363):<br></div>   xorl   %ebx, %ebx            <--- xor is only for zero; in the general case, this would be a mov (load immediate)<br></div>   movq   %rbx, 376(%rax)<br>   movq   %rbx, 384(%rax)<br>   movq   %rbx, 392(%rax)    <br>...<br><br></div>I don't think this qualifies as a no-brainer when not optimizing for size. If it's an xor to make a zero, then it's *almost* free (handled by the renamer) on any recent OOO x86, but that xor still requires some decode/uop resources.</div></div></div></div></div></blockquote><div><br></div><div>I would double-check this on Jaguar. It may use a mov-imm micro-op.</div><div><br></div><div>Jaguar can decode 2 inst per cycle (but see below), so since `mov [mem],imm`  has throughput 1 (See Agner) we should not be decode bottlenecked.</div><div>For reference, `mov [mem],imm` is decoded into 2 micro-ops (see "Table 1. Typical Instruction Mappings" in [SOG]) whereas `mov [mem],reg` is only 1 micro-op, so it is *preferable* to use a reg since it amortizes the cost of the `mov-imm` micro-op across the stores.</div><div><br></div><div>[SOG] <a href="http://support.amd.com/TechDocs/52128_16h_Software_Opt_Guide.zip">http://support.amd.com/TechDocs/52128_16h_Software_Opt_Guide.zip</a></div><div><br></div><div>On Jaguar at least, I believe that >=4 11-byte instructions in a row like will cause Jaguar's poor little decoder to fail to decode 2 inst. per cycle on at least one of them, so that should amortize the decode cost of the xor anyway. Intel chips are usually not decode limited like this though. Simon Whittaker has more information about under exactly what circumstances Jaguar will fail to decode, but it would be nice if we could do a peephole across the text checking every pair of adjacent commutable instructions to avoid the situations where we would fail to decode (IIRC, basically if the opcode byte of the second instruction falls more than 5 bytes into the second 16-byte window).</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><div> In the general case where we use a mov to load the reg with the immediate, this argument gets tougher. We need a perf heuristic that says 'saving X bytes of instructions is worth adding Y extra instructions of type Z'?<br></div></div></div></div></div></blockquote><div><br></div><div>Thankfully almost all the cases I've seen it is 0, so I think we can get the lion's share of cases assuming only 0.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><div><br></div>3. If we really want to make this example smaller and faster, we should merge stores like we do in DAGCombiner's MergeConsecutiveStores:<br></div>   vxorps   %ymm0, %ymm0   <--- assumes we have AVX for 32-byte ops; if not, SSE for 16-byte<br></div>   vmovups   %ymm0, 376(%rax)<br>...<br><br></div></div></blockquote><div><br></div><div>This seems like it is always a win.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div></div>Are the stores in the example created too late for the DAGCombiner? Do we need to repeat some subset of that merge functionality in a machine pass?<br></div></blockquote><div><br></div><div>I'm not an expert in this backend stuff, but the decisions we need to make here are pretty low-level dependent (e.g. exploiting proximity to a call to know certain registers are available), so it probably makes sense to do this at the MI layer.</div><div><br></div><div>-- Sean Silva</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div><div><br><div><div><div><div><div><br></div></div></div></div></div></div></div></div></div></div></div></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Aug 12, 2015 at 2:00 AM, Sean Silva <span dir="ltr"><<a href="mailto:chisophugis@gmail.com" target="_blank">chisophugis@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">This may be interesting even outside of optimizing for size. For example, I see wonderful fragments like the following all over my binaries:<div><br><div><p style="margin:0px;font-size:12px;font-family:Menlo">   20b7f:<span style="color:#5330e1">`      </span>48 c7 80 78 01 00 00 00 00 00 00 <span style="color:#5330e1">`      </span>movq<span style="color:#5330e1">`   </span>$0, 376(%rax)</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">   20b8a:<span style="color:#5330e1">`      </span>48 c7 80 80 01 00 00 00 00 00 00 <span style="color:#5330e1">`      </span>movq<span style="color:#5330e1">`   </span>$0, 384(%rax)</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">   20b95:<span style="color:#5330e1">`      </span>48 c7 80 88 01 00 00 00 00 00 00 <span style="color:#5330e1">`      </span>movq<span style="color:#5330e1">`   </span>$0, 392(%rax)</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">   20ba0:<span style="color:#5330e1">`      </span>48 c7 80 90 01 00 00 00 00 00 00 <span style="color:#5330e1">`      </span>movq<span style="color:#5330e1">`   </span>$0, 400(%rax)</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">   20bab:<span style="color:#5330e1">`      </span>48 c7 80 a0 01 00 00 00 00 00 00 <span style="color:#5330e1">`      </span>movq<span style="color:#5330e1">`   </span>$0, 416(%rax)</p>

<p style="margin:0px;font-size:12px;font-family:Menlo">   20bb6:<span style="color:#5330e1">`      </span>48 c7 80 a8 09 00 00 00 00 00 00 <span style="color:#5330e1">`      </span>movq<span style="color:#5330e1">`   </span>$0, 2472(%rax)</p></div><div><br></div><div>Yes, 11 byte instructions just to zero out 8 bytes of memory :)</div><div>If you are omitting frame pointer and are using stack slots beyond an imm8, pay an extra byte for the SIB.</div><div><br></div><div>As observed in this patch, this pattern often occurs in close proximity to calls (the register is usually rax for a return value or rbp/rsp when preparing stack-passed arguments for a call), so there is often a no-brainer choice of register to xor and use r,m forms.</div><div><br></div><div>I have seen functions for which more than half of the text size is due to these sorts of instructions, getting into pretty serious icache threat territory and worth looking at even outside optsize.</div><span><font color="#888888"><div><br></div><div>-- Sean Silva</div><div><br></div><div><br></div><div><br></div></font></span></div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Aug 11, 2015 at 7:10 AM, Michael Kuperstein via llvm-commits <span dir="ltr"><<a href="mailto:llvm-commits@lists.llvm.org" target="_blank">llvm-commits@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Author: mkuper<br>

Date: Tue Aug 11 09:10:58 2015<br>

New Revision: 244601<br>

<br>

URL: <a href="http://llvm.org/viewvc/llvm-project?rev=244601&view=rev" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-project?rev=244601&view=rev</a><br>

Log:<br>

[X86] Allow merging of immediates within a basic block for code size savings<br>

<br>

First step in preventing immediates that occur more than once within a single<br>

basic block from being pulled into their users, in order to prevent unnecessary<br>

large instruction encoding .Currently enabled only when optimizing for size.<br>

<br>

Patch by: <a href="mailto:zia.ansari@intel.com" target="_blank">zia.ansari@intel.com</a><br>

Differential Revision: <a href="http://reviews.llvm.org/D11363" rel="noreferrer" target="_blank">http://reviews.llvm.org/D11363</a><br>

<br>

Added:<br>

    llvm/trunk/test/CodeGen/X86/immediate_merging.ll<br>

Removed:<br>

    llvm/trunk/test/CodeGen/X86/remat-invalid-liveness.ll<br>

Modified:<br>

    llvm/trunk/lib/Target/X86/X86ISelDAGToDAG.cpp<br>

    llvm/trunk/lib/Target/X86/X86InstrArithmetic.td<br>

    llvm/trunk/lib/Target/X86/X86InstrInfo.td<br>

<br>

Modified: llvm/trunk/lib/Target/X86/X86ISelDAGToDAG.cpp<br>

URL: <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86ISelDAGToDAG.cpp?rev=244601&r1=244600&r2=244601&view=diff" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86ISelDAGToDAG.cpp?rev=244601&r1=244600&r2=244601&view=diff</a><br>

==============================================================================<br>

--- llvm/trunk/lib/Target/X86/X86ISelDAGToDAG.cpp (original)<br>

+++ llvm/trunk/lib/Target/X86/X86ISelDAGToDAG.cpp Tue Aug 11 09:10:58 2015<br>

@@ -283,6 +283,82 @@ namespace {<br>

         Segment = CurDAG->getRegister(0, MVT::i32);<br>

     }<br>

<br>

+    // Utility function to determine whether we should avoid selecting<br>

+    // immediate forms of instructions for better code size or not.<br>

+    // At a high level, we'd like to avoid such instructions when<br>

+    // we have similar constants used within the same basic block<br>

+    // that can be kept in a register.<br>

+    //<br>

+    bool shouldAvoidImmediateInstFormsForSize(SDNode *N) const {<br>

+      uint32_t UseCount = 0;<br>

+<br>

+      // Do not want to hoist if we're not optimizing for size.<br>

+      // TODO: We'd like to remove this restriction.<br>

+      // See the comment in X86InstrInfo.td for more info.<br>

+      if (!OptForSize)<br>

+        return false;<br>

+<br>

+      // Walk all the users of the immediate.<br>

+      for (SDNode::use_iterator UI = N->use_begin(),<br>

+           UE = N->use_end(); (UI != UE) && (UseCount < 2); ++UI) {<br>

+<br>

+        SDNode *User = *UI;<br>

+<br>

+        // This user is already selected. Count it as a legitimate use and<br>

+        // move on.<br>

+        if (User->isMachineOpcode()) {<br>

+          UseCount++;<br>

+          continue;<br>

+        }<br>

+<br>

+        // We want to count stores of immediates as real uses.<br>

+        if (User->getOpcode() == ISD::STORE &&<br>

+            User->getOperand(1).getNode() == N) {<br>

+          UseCount++;<br>

+          continue;<br>

+        }<br>

+<br>

+        // We don't currently match users that have > 2 operands (except<br>

+        // for stores, which are handled above)<br>

+        // Those instruction won't match in ISEL, for now, and would<br>

+        // be counted incorrectly.<br>

+        // This may change in the future as we add additional instruction<br>

+        // types.<br>

+        if (User->getNumOperands() != 2)<br>

+          continue;<br>

+<br>

+        // Immediates that are used for offsets as part of stack<br>

+        // manipulation should be left alone. These are typically<br>

+        // used to indicate SP offsets for argument passing and<br>

+        // will get pulled into stores/pushes (implicitly).<br>

+        if (User->getOpcode() == X86ISD::ADD ||<br>

+            User->getOpcode() == ISD::ADD    ||<br>

+            User->getOpcode() == X86ISD::SUB ||<br>

+            User->getOpcode() == ISD::SUB) {<br>

+<br>

+          // Find the other operand of the add/sub.<br>

+          SDValue OtherOp = User->getOperand(0);<br>

+          if (OtherOp.getNode() == N)<br>

+            OtherOp = User->getOperand(1);<br>

+<br>

+          // Don't count if the other operand is SP.<br>

+          RegisterSDNode *RegNode;<br>

+          if (OtherOp->getOpcode() == ISD::CopyFromReg &&<br>

+              (RegNode = dyn_cast_or_null<RegisterSDNode>(<br>

+                 OtherOp->getOperand(1).getNode())))<br>

+            if ((RegNode->getReg() == X86::ESP) ||<br>

+                (RegNode->getReg() == X86::RSP))<br>

+              continue;<br>

+        }<br>

+<br>

+        // ... otherwise, count this and move on.<br>

+        UseCount++;<br>

+      }<br>

+<br>

+      // If we have more than 1 use, then recommend for hoisting.<br>

+      return (UseCount > 1);<br>

+    }<br>

+<br>

     /// getI8Imm - Return a target constant with the specified value, of type<br>

     /// i8.<br>

     inline SDValue getI8Imm(unsigned Imm, SDLoc DL) {<br>

<br>

Modified: llvm/trunk/lib/Target/X86/X86InstrArithmetic.td<br>

URL: <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86InstrArithmetic.td?rev=244601&r1=244600&r2=244601&view=diff" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86InstrArithmetic.td?rev=244601&r1=244600&r2=244601&view=diff</a><br>

==============================================================================<br>

--- llvm/trunk/lib/Target/X86/X86InstrArithmetic.td (original)<br>

+++ llvm/trunk/lib/Target/X86/X86InstrArithmetic.td Tue Aug 11 09:10:58 2015<br>

@@ -615,14 +615,14 @@ class X86TypeInfo<ValueType vt, string i<br>

 def invalid_node : SDNode<"<<invalid_node>>", SDTIntLeaf,[],"<<invalid_node>>">;<br>

<br>

<br>

-def Xi8  : X86TypeInfo<i8 , "b", GR8 , loadi8 , i8mem ,<br>

-                       Imm8 , i8imm ,    imm,          i8imm   , invalid_node,<br>

+def Xi8  : X86TypeInfo<i8, "b", GR8, loadi8, i8mem,<br>

+                       Imm8, i8imm, imm8_su, i8imm, invalid_node,<br>

                        0, OpSizeFixed, 0>;<br>

 def Xi16 : X86TypeInfo<i16, "w", GR16, loadi16, i16mem,<br>

-                       Imm16, i16imm,    imm,          i16i8imm, i16immSExt8,<br>

+                       Imm16, i16imm, imm16_su, i16i8imm, i16immSExt8_su,<br>

                        1, OpSize16, 0>;<br>

 def Xi32 : X86TypeInfo<i32, "l", GR32, loadi32, i32mem,<br>

-                       Imm32, i32imm,    imm,          i32i8imm, i32immSExt8,<br>

+                       Imm32, i32imm, imm32_su, i32i8imm, i32immSExt8_su,<br>

                        1, OpSize32, 0>;<br>

 def Xi64 : X86TypeInfo<i64, "q", GR64, loadi64, i64mem,<br>

                        Imm32S, i64i32imm, i64immSExt32, i64i8imm, i64immSExt8,<br>

<br>

Modified: llvm/trunk/lib/Target/X86/X86InstrInfo.td<br>

URL: <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86InstrInfo.td?rev=244601&r1=244600&r2=244601&view=diff" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86InstrInfo.td?rev=244601&r1=244600&r2=244601&view=diff</a><br>

==============================================================================<br>

--- llvm/trunk/lib/Target/X86/X86InstrInfo.td (original)<br>

+++ llvm/trunk/lib/Target/X86/X86InstrInfo.td Tue Aug 11 09:10:58 2015<br>

@@ -873,6 +873,40 @@ def i16immSExt8  : ImmLeaf<i16, [{ retur<br>

 def i32immSExt8  : ImmLeaf<i32, [{ return Imm == (int8_t)Imm; }]>;<br>

 def i64immSExt8  : ImmLeaf<i64, [{ return Imm == (int8_t)Imm; }]>;<br>

<br>

+// If we have multiple users of an immediate, it's much smaller to reuse<br>

+// the register, rather than encode the immediate in every instruction.<br>

+// This has the risk of increasing register pressure from stretched live<br>

+// ranges, however, the immediates should be trivial to rematerialize by<br>

+// the RA in the event of high register pressure.<br>

+// TODO : This is currently enabled for stores and binary ops. There are more<br>

+// cases for which this can be enabled, though this catches the bulk of the<br>

+// issues.<br>

+// TODO2 : This should really also be enabled under O2, but there's currently<br>

+// an issue with RA where we don't pull the constants into their users<br>

+// when we rematerialize them. I'll follow-up on enabling O2 after we fix that<br>

+// issue.<br>

+// TODO3 : This is currently limited to single basic blocks (DAG creation<br>

+// pulls block immediates to the top and merges them if necessary).<br>

+// Eventually, it would be nice to allow ConstantHoisting to merge constants<br>

+// globally for potentially added savings.<br>

+//<br>

+def imm8_su : PatLeaf<(i8 imm), [{<br>

+    return !shouldAvoidImmediateInstFormsForSize(N);<br>

+}]>;<br>

+def imm16_su : PatLeaf<(i16 imm), [{<br>

+    return !shouldAvoidImmediateInstFormsForSize(N);<br>

+}]>;<br>

+def imm32_su : PatLeaf<(i32 imm), [{<br>

+    return !shouldAvoidImmediateInstFormsForSize(N);<br>

+}]>;<br>

+<br>

+def i16immSExt8_su : PatLeaf<(i16immSExt8), [{<br>

+    return !shouldAvoidImmediateInstFormsForSize(N);<br>

+}]>;<br>

+def i32immSExt8_su : PatLeaf<(i32immSExt8), [{<br>

+    return !shouldAvoidImmediateInstFormsForSize(N);<br>

+}]>;<br>

+<br>

<br>

 def i64immSExt32 : ImmLeaf<i64, [{ return Imm == (int32_t)Imm; }]>;<br>

<br>

@@ -1283,13 +1317,13 @@ def MOV32ri_alt : Ii32<0xC7, MRM0r, (out<br>

 let SchedRW = [WriteStore] in {<br>

 def MOV8mi  : Ii8 <0xC6, MRM0m, (outs), (ins i8mem :$dst, i8imm :$src),<br>

                    "mov{b}\t{$src, $dst|$dst, $src}",<br>

-                   [(store (i8 imm:$src), addr:$dst)], IIC_MOV_MEM>;<br>

+                   [(store (i8 imm8_su:$src), addr:$dst)], IIC_MOV_MEM>;<br>

 def MOV16mi : Ii16<0xC7, MRM0m, (outs), (ins i16mem:$dst, i16imm:$src),<br>

                    "mov{w}\t{$src, $dst|$dst, $src}",<br>

-                   [(store (i16 imm:$src), addr:$dst)], IIC_MOV_MEM>, OpSize16;<br>

+                   [(store (i16 imm16_su:$src), addr:$dst)], IIC_MOV_MEM>, OpSize16;<br>

 def MOV32mi : Ii32<0xC7, MRM0m, (outs), (ins i32mem:$dst, i32imm:$src),<br>

                    "mov{l}\t{$src, $dst|$dst, $src}",<br>

-                   [(store (i32 imm:$src), addr:$dst)], IIC_MOV_MEM>, OpSize32;<br>

+                   [(store (i32 imm32_su:$src), addr:$dst)], IIC_MOV_MEM>, OpSize32;<br>

 def MOV64mi32 : RIi32S<0xC7, MRM0m, (outs), (ins i64mem:$dst, i64i32imm:$src),<br>

                        "mov{q}\t{$src, $dst|$dst, $src}",<br>

                        [(store i64immSExt32:$src, addr:$dst)], IIC_MOV_MEM>;<br>

<br>

Added: llvm/trunk/test/CodeGen/X86/immediate_merging.ll<br>

URL: <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/immediate_merging.ll?rev=244601&view=auto" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/immediate_merging.ll?rev=244601&view=auto</a><br>

==============================================================================<br>

--- llvm/trunk/test/CodeGen/X86/immediate_merging.ll (added)<br>

+++ llvm/trunk/test/CodeGen/X86/immediate_merging.ll Tue Aug 11 09:10:58 2015<br>

@@ -0,0 +1,82 @@<br>

+; RUN: llc -o - -mtriple=i386-unknown-linux-gnu < %s | FileCheck %s<br>

+; RUN: llc -o - -mtriple=x86_64-unknown-linux-gnu < %s | FileCheck %s<br>

+<br>

+@a = common global i32 0, align 4<br>

+@b = common global i32 0, align 4<br>

+@c = common global i32 0, align 4<br>

+@e = common global i32 0, align 4<br>

+@x = common global i32 0, align 4<br>

+@f = common global i32 0, align 4<br>

+@h = common global i32 0, align 4<br>

+@i = common global i32 0, align 4<br>

+<br>

+; Test -Os to make sure immediates with multiple users don't get pulled in to<br>

+; instructions.<br>

+define i32 @foo() optsize {<br>

+; CHECK-LABEL: foo:<br>

+; CHECK: movl $1234, [[R1:%[a-z]+]]<br>

+; CHECK-NOT: movl $1234, a<br>

+; CHECK-NOT: movl $1234, b<br>

+; CHECK-NOT: movl $12, c<br>

+; CHECK-NOT: cmpl $12, e<br>

+; CHECK: movl [[R1]], a<br>

+; CHECK: movl [[R1]], b<br>

+<br>

+entry:<br>

+  store i32 1234, i32* @a<br>

+  store i32 1234, i32* @b<br>

+  store i32 12, i32* @c<br>

+  %0 = load i32, i32* @e<br>

+  %cmp = icmp eq i32 %0, 12<br>

+  br i1 %cmp, label %if.then, label %if.end<br>

+<br>

+if.then:                                          ; preds = %entry<br>

+  store i32 1, i32* @x<br>

+  br label %if.end<br>

+<br>

+; New block.. Make sure 1234 isn't live across basic blocks from before.<br>

+; CHECK: movl $1234, f<br>

+; CHECK: movl $555, [[R3:%[a-z]+]]<br>

+; CHECK-NOT: movl $555, h<br>

+; CHECK-NOT: addl $555, i<br>

+; CHECK: movl [[R3]], h<br>

+; CHECK: addl [[R3]], i<br>

+<br>

+if.end:                                           ; preds = %if.then, %entry<br>

+  store i32 1234, i32* @f<br>

+  store i32 555, i32* @h<br>

+  %1 = load i32, i32* @i<br>

+  %add1 = add nsw i32 %1, 555<br>

+  store i32 %add1, i32* @i<br>

+  ret i32 0<br>

+}<br>

+<br>

+; Test -O2 to make sure that all immediates get pulled in to their users.<br>

+define i32 @foo2() {<br>

+; CHECK-LABEL: foo2:<br>

+; CHECK: movl $1234, a<br>

+; CHECK: movl $1234, b<br>

+<br>

+entry:<br>

+  store i32 1234, i32* @a<br>

+  store i32 1234, i32* @b<br>

+<br>

+  ret i32 0<br>

+}<br>

+<br>

+declare void @llvm.memset.p0i8.i32(i8* nocapture, i8, i32, i32, i1) #1<br>

+<br>

+@AA = common global [100 x i8] zeroinitializer, align 1<br>

+<br>

+; memset gets lowered in DAG. Constant merging should hoist all the<br>

+; immediates used to store to the individual memory locations. Make<br>

+; sure we don't directly store the immediates.<br>

+define void @foomemset() optsize {<br>

+; CHECK-LABEL: foomemset:<br>

+; CHECK-NOT: movl ${{.*}}, AA<br>

+; CHECK: mov{{l|q}} %{{e|r}}ax, AA<br>

+<br>

+entry:<br>

+  call void @llvm.memset.p0i8.i32(i8* getelementptr inbounds ([100 x i8], [100 x i8]* @AA, i32 0, i32 0), i8 33, i32 24, i32 1, i1 false)<br>

+  ret void<br>

+}<br>

<br>

Removed: llvm/trunk/test/CodeGen/X86/remat-invalid-liveness.ll<br>

URL: <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/remat-invalid-liveness.ll?rev=244600&view=auto" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/remat-invalid-liveness.ll?rev=244600&view=auto</a><br>

==============================================================================<br>

--- llvm/trunk/test/CodeGen/X86/remat-invalid-liveness.ll (original)<br>

+++ llvm/trunk/test/CodeGen/X86/remat-invalid-liveness.ll (removed)<br>

@@ -1,85 +0,0 @@<br>

-; RUN: llc %s -mcpu=core2 -o - | FileCheck %s<br>

-; This test was failing while tracking the liveness in the register scavenger<br>

-; during the branching folding pass. The allocation of the subregisters was<br>

-; incorrect.<br>

-; I.e., the faulty pattern looked like:<br>

-; CH = movb 64<br>

-; ECX = movl 3 <- CH was killed here.<br>

-; CH = subb CH, ...<br>

-;<br>

-; This reduced test case triggers the crash before the fix, but does not<br>

-; strictly speaking check that the resulting code is correct.<br>

-; To check that the code is actually correct we would need to check the<br>

-; liveness of the produced code.<br>

-;<br>

-; Currently, we check that after ECX = movl 3, we do not have subb CH,<br>

-; whereas CH could have been redefine in between and that would have been<br>

-; totally fine.<br>

-; <rdar://problem/16582185><br>

-target datalayout = "e-m:o-p:32:32-f64:32:64-f80:128-n8:16:32-S128"<br>

-target triple = "i386-apple-macosx10.9"<br>

-<br>

-%struct.A = type { %struct.B, %struct.C, %struct.D*, [1 x i8*] }<br>

-%struct.B = type { i32, [4 x i8] }<br>

-%struct.C = type { i128 }<br>

-%struct.D = type { {}*, [0 x i32] }<br>

-%union.E = type { i32 }<br>

-<br>

-; CHECK-LABEL: __XXX1:<br>

-; CHECK: movl $3, %ecx<br>

-; CHECK-NOT: subb %{{[a-z]+}}, %ch<br>

-; Function Attrs: nounwind optsize ssp<br>

-define fastcc void @__XXX1(%struct.A* %ht) #0 {<br>

-entry:<br>

-  %const72 = bitcast i128 72 to i128<br>

-  %const3 = bitcast i128 3 to i128<br>

-  switch i32 undef, label %if.end196 [<br>

-    i32 1, label %sw.bb.i<br>

-    i32 3, label %sw.bb2.i<br>

-  ]<br>

-<br>

-sw.bb.i:                                          ; preds = %entry<br>

-  %call.i.i.i = tail call i32 undef(%struct.A* %ht, i8 zeroext 22, i32 undef, i32 0, %struct.D* undef)<br>

-  %bf.load.i.i = load i128, i128* undef, align 4<br>

-  %bf.lshr.i.i = lshr i128 %bf.load.i.i, %const72<br>

-  %shl1.i.i = shl nuw nsw i128 %bf.lshr.i.i, 8<br>

-  %shl.i.i = trunc i128 %shl1.i.i to i32<br>

-  br i1 undef, label %cond.false10.i.i, label %__XXX2.exit.i.i<br>

-<br>

-__XXX2.exit.i.i:                    ; preds = %sw.bb.i<br>

-  %extract11.i.i.i = lshr i128 %bf.load.i.i, %const3<br>

-  %extract.t12.i.i.i = trunc i128 %extract11.i.i.i to i32<br>

-  %bf.cast7.i.i.i = and i32 %extract.t12.i.i.i, 3<br>

-  %arrayidx.i.i.i = getelementptr inbounds %struct.A, %struct.A* %ht, i32 0, i32 3, i32 %bf.cast7.i.i.i<br>

-  br label %cond.end12.i.i<br>

-<br>

-cond.false10.i.i:                                 ; preds = %sw.bb.i<br>

-  %arrayidx.i6.i.i = getelementptr inbounds %struct.A, %struct.A* %ht, i32 0, i32 3, i32 0<br>

-  br label %cond.end12.i.i<br>

-<br>

-cond.end12.i.i:                                   ; preds = %cond.false10.i.i, %__XXX2.exit.i.i<br>

-  %.sink.in.i.i = phi i8** [ %arrayidx.i.i.i, %__XXX2.exit.i.i ], [ %arrayidx.i6.i.i, %cond.false10.i.i ]<br>

-  %.sink.i.i = load i8*, i8** %.sink.in.i.i, align 4<br>

-  %tmp = bitcast i8* %.sink.i.i to %union.E*<br>

-  br i1 undef, label %for.body.i.i, label %if.end196<br>

-<br>

-for.body.i.i:                                     ; preds = %for.body.i.i, %cond.end12.i.i<br>

-  %weak.i.i = getelementptr inbounds %union.E, %union.E* %tmp, i32 undef, i32 0<br>

-  %tmp1 = load i32, i32* %weak.i.i, align 4<br>

-  %cmp36.i.i = icmp ne i32 %tmp1, %shl.i.i<br>

-  %or.cond = and i1 %cmp36.i.i, false<br>

-  br i1 %or.cond, label %for.body.i.i, label %if.end196<br>

-<br>

-sw.bb2.i:                                         ; preds = %entry<br>

-  %bf.lshr.i85.i = lshr i128 undef, %const72<br>

-  br i1 undef, label %if.end196, label %__XXX2.exit.i95.i<br>

-<br>

-__XXX2.exit.i95.i:                  ; preds = %sw.bb2.i<br>

-  %extract11.i.i91.i = lshr i128 undef, %const3<br>

-  br label %if.end196<br>

-<br>

-if.end196:                                        ; preds = %__XXX2.exit.i95.i, %sw.bb2.i, %for.body.i.i, %cond.end12.i.i, %entry<br>

-  ret void<br>

-}<br>

-<br>

-attributes #0 = { nounwind optsize ssp "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" }<br>

<br>

<br>

_______________________________________________<br>

llvm-commits mailing list<br>

<a href="mailto:llvm-commits@lists.llvm.org" target="_blank">llvm-commits@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits</a><br>

</blockquote></div><br></div>

</div></div></blockquote></div><br></div>

</div></div></blockquote></div><br></div></div>