[PATCH] D18048: [AArch64] Enable load clustering of unscaled loads in the MI Scheduler.

Chad Rosier via llvm-commits llvm-commits at lists.llvm.org
Thu Mar 10 08:47:44 PST 2016


mcrosier created this revision.
mcrosier added reviewers: jmolloy, t.p.northover.
mcrosier added subscribers: llvm-commits, gberry, mssimpso, junbuml, bmakam, haicheng.
Herald added subscribers: rengolin, aemerson.

This patch adds unscaled loads to the TII getMemOpBaseRegImmOfs API, which is used to control clustering in the MI scheduler.  This is done to create more opportunities for load pairing.  I've also added the scaled LDRSWui instruction, which was missing from the scaled instructions.  Overall, this patch increases the number of unscaled pairs by about 3% for Spec2006.  I saw similar results for Spec2000.  Also, I didn't see any serious changes in the register allocator statistics (see below).

I'm crafting a test case, but whatever I come up with, if anything at all, will likely be fragile.  Suggestions on how to do this are welcome!

Below is a summary of the llvm stats when comparing without and with this patch.  For example, the first stat indicates 148 (or ~3.18%) more ldps are generated from unscaled loads and and the total number of paired instructions increase by 317 (or 0.64%).
 
Summary:
         148 (3.18) aarch64-ldst-opt - Number of load/store from unscaled generated     
         317 (0.64) aarch64-ldst-opt - Number of load/store pair instructions generated 
          10 (0.42) aarch64-ldst-opt - Number of post-index updates folded              
        -272 (-0.01) asm-printer - Number of machine instrs printed                      
        -936 (-0.00) assembler - Number of emitted object file bytes                     
           6 (0.00) assembler - Number of evaluated fixups                              
           6 (0.00) mccodeemitter - Number of MC fixups created.                        
        -272 (-0.01) mccodeemitter - Number of MC instructions emitted.                  
          11 (0.00) mcexpr - Number of MCExpr evaluations                               
          96 (0.00) pei - Number of bytes used for stack in all functions               
           4 (0.00) regalloc - Number of copies inserted for splitting                  
          -1 (-0.00) regalloc - Number of identity moves eliminated after rewriting      
           3 (0.01) regalloc - Number of interferences evicted                          
           2 (0.22) regalloc - Number of live ranges fractured by DCE                   
          15 (0.01) regalloc - Number of new live ranges queued                         
           8 (0.00) regalloc - Number of registers assigned                             
           4 (0.01) regalloc - Number of registers unassigned                           
           1 (0.00) regalloc - Number of rematerialized defs for spilling               
          -1 (-0.03) regalloc - Number of rematerialized defs for splitting              
           2 (0.01) regalloc - Number of spill slots allocated                          
           1 (0.00) regalloc - Number of spilled live ranges                            
           5 (0.04) regalloc - Number of split global live ranges                       
          -2 (-0.10) regalloc - Number of split local live ranges                        
           2 (0.01) regalloc - Number of splits finished                                
           2 (0.01) regalloc - Number of splits that were simple                        
          29 (0.07) slotindexes - Number of local renumberings                          
           1 (0.03) stackslotcoloring - Number of stack slots eliminated due to coloring
           1 (0.00) tailduplication - Additional instructions due to tail duplication   
           1 (0.04) tailduplication - Number of dead blocks removed 

Passed all correctness for EEMBC, Spec200X, llvm test-suite.  Performance results look to be mostly noise with minor improvements here and there.

 Chad

http://reviews.llvm.org/D18048

Files:
  lib/Target/AArch64/AArch64InstrInfo.cpp

Index: lib/Target/AArch64/AArch64InstrInfo.cpp
===================================================================
--- lib/Target/AArch64/AArch64InstrInfo.cpp
+++ lib/Target/AArch64/AArch64InstrInfo.cpp
@@ -1359,6 +1359,14 @@
   case AArch64::LDRQui:
   case AArch64::LDRXui:
   case AArch64::LDRWui:
+  case AArch64::LDRSWui:
+  // Unscaled instructions.
+  case AArch64::LDURSi:
+  case AArch64::LDURDi:
+  case AArch64::LDURQi:
+  case AArch64::LDURWi:
+  case AArch64::LDURXi:
+  case AArch64::LDURSWi:
     unsigned Width;
     return getMemOpBaseRegImmOfsWidth(LdSt, BaseReg, Offset, Width, TRI);
   };
@@ -1428,6 +1436,7 @@
     Scale = Width = 8;
     break;
   case AArch64::LDRWui:
+  case AArch64::LDRSWui:
   case AArch64::LDRSui:
   case AArch64::STRWui:
   case AArch64::STRSui:
@@ -1463,14 +1472,47 @@
     return false;
   if (FirstLdSt->getOpcode() != SecondLdSt->getOpcode())
     return false;
-  // getMemOpBaseRegImmOfs guarantees that oper 2 isImm.
-  unsigned Ofs1 = FirstLdSt->getOperand(2).getImm();
-  // Allow 6 bits of positive range.
-  if (Ofs1 > 64)
+
+  // getMemOpBaseRegImmOfs guarantees that operand 2 isImm.
+  int64_t Offset1 = FirstLdSt->getOperand(2).getImm();
+  int64_t Offset2 = SecondLdSt->getOperand(2).getImm();
+
+  // Scale the unscaled offsets.
+  if (isUnscaledLdSt(FirstLdSt)) {
+    unsigned OffsetStride = 1;
+    switch (FirstLdSt->getOpcode()) {
+    default:
+      return false;
+    case AArch64::LDURQi:
+      OffsetStride = 16;
+      break;
+    case AArch64::LDURXi:
+    case AArch64::LDURDi:
+      OffsetStride = 8;
+      break;
+    case AArch64::LDURWi:
+    case AArch64::LDURSi:
+    case AArch64::LDURSWi:
+      OffsetStride = 4;
+      break;
+    }
+    // If the byte-offset isn't a multiple of the stride, we can't pair these
+    // loads/stores.
+    if (Offset1 % OffsetStride)
+      return false;
+
+    // Convert the byte-offset used by unscaled into an "element" offset used
+    // by the scaled pair load/store instructions.
+    Offset1 /= OffsetStride;
+    Offset2 /= OffsetStride;
+  }
+  // Pairwise instructions have a 7-bit signed offset field.
+  if (Offset1 > 64 || Offset1 < -64)
     return false;
+
   // The caller should already have ordered First/SecondLdSt by offset.
-  unsigned Ofs2 = SecondLdSt->getOperand(2).getImm();
-  return Ofs1 + 1 == Ofs2;
+  assert(Offset1 <= Offset2 && "Caller should have ordered offsets.");
+  return Offset1 + 1 == Offset2;
 }
 
 bool AArch64InstrInfo::shouldScheduleAdjacent(MachineInstr *First,


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D18048.50279.patch
Type: text/x-patch
Size: 2532 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160310/d49d0456/attachment.bin>


More information about the llvm-commits mailing list