[LLVMdev] X86 - Help on fixing a poor code generation bug

Thu Dec 5 07:35:19 PST 2013

Hi all,

I noticed that the x86 backend tends to emit unnecessary vector insert
instructions immediately after sse scalar fp instructions like
addss/mulss.

For example:
/////////////////////////////////
__m128 foo(__m128 A, __m128 B) {
  _mm_add_ss(A, B);
}
/////////////////////////////////

produces the sequence:
  addss  %xmm0, %xmm1
  movss  %xmm1, %xmm0

which could be easily optimized into
  addss  %xmm1, %xmm0

The first step is to understand why the compiler produces the
redundant instructions.

For the example above, the DAG sequence looks like this:

  a0 :     f32 = extract_vector_elt ( A, 0)
  b0 :     f32 = extract_vector_elt ( B, 0)
  r0 :     f32 = fadd a0, b0
  result : v4f32 = insert_vector_elt ( A, r0, 0 )

(with A and B of type v4f32).

The 'insert_vector_elt' is custom lowered into either X86ISD::MOVSS or
X86ISD::INSERTPS depending on the target's SSE feature level.

To start I checked if this bug was caused simply by the lack of
specific tablegen patterns to match the complex sequence described
above into a single ADDSS instruction.

However X86InstrSSE.td already defines an instruction X86::ADDSSrr as
a commutative SSE scalar fp add instruction (that works on F32
ValueTypes). Instruction X86::ADDSSrr is used to select 'fadd' nodes
and it will be translated to 'addss' in assembly.

At this stage, the MOVSS/INSERTPS is still required since the ADDSS
alone would not be equivalent to the hardware 'movss' instruction.

I then started investigating the possibility of adding a pass that
runs at 'postRegAlloc' stage.

Before RegAlloc it may not be safe to remove the redundant MOVSSrr
because of the TwoAddressInstruction Pass; this may decide to commute
the operands of the ADDSS/MULSS.

It is possible to write a pass that scans through each basic block in
a function looking for opportunities to fold instructions based on the
following patterns:

  //////
  B<def, tied1> = #NAME#SSrr B<kill, tied0>, A
  A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
    ==>
  A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
  /////

  /////
  B<def, tied1> = #NAME#PSrr B<kill, tied0>, A
  A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
    ==>
  A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
  /////

With 'NAME' in {ADD, SUB, MUL, DIV}.

I managed to create a very simple prototype pass which simplifies the
code according to pattern #1 (see the patch in attachment).
However, identifying patterns at postRegAlloc stage is definitely more
complicated than matching patterns on dags: instructions are not in
the SSA form; the pass must keep track of defs and kills and leave the
liveness info in a consistent state; some instructions may have side
effects etc.
Last but not least, the register mapping implemented by the register
allocator and how instructions are scheduled may strongly affect the
ability of a pass like this to pattern match specific sequences of
instructions.

In conclusion, my questions are:
Would a patch be acceptable that introduces a new MachineFunctionPass
that runs at 'postRegAlloc' stage?
If not, is there a better way to fix this bug?

Thanks,
Andrea Di Biagio
SN Systems - Sony Computer Entertainment Group
-------------- next part --------------
Index: lib/Target/X86/X86.h
===================================================================

--- lib/Target/X86/X86.h	(revision 196508)
+++ lib/Target/X86/X86.h	(working copy)
@@ -47,6 +47,8 @@
 ///
 FunctionPass *createX86FloatingPointStackifierPass();
 
+FunctionPass *createX86FoldRedundantInsertsPass();
+
 /// createX86IssueVZeroUpperPass - This pass inserts AVX vzeroupper instructions
 /// before each call to avoid transition penalty between functions encoded with
 /// AVX and SSE.
Index: lib/Target/X86/X86TargetMachine.cpp
===================================================================
--- lib/Target/X86/X86TargetMachine.cpp	(revision 196508)
+++ lib/Target/X86/X86TargetMachine.cpp	(working copy)
@@ -197,6 +197,10 @@
 
 bool X86PassConfig::addPostRegAlloc() {
   addPass(createX86FloatingPointStackifierPass());
+
+  if (getOptLevel() != CodeGenOpt::None)
+    addPass(createX86FoldRedundantInsertsPass());
+
   return true;  // -print-machineinstr should print after this.
 }
 
@@ -222,7 +226,6 @@
     addPass(createX86FixupLEAs());
     ShouldPrint = true;
   }
-
   return ShouldPrint;
 }
 
Index: lib/Target/X86/CMakeLists.txt
===================================================================
--- lib/Target/X86/CMakeLists.txt	(revision 196508)
+++ lib/Target/X86/CMakeLists.txt	(working copy)
@@ -18,6 +18,7 @@
   X86CodeEmitter.cpp
   X86FastISel.cpp
   X86FloatingPoint.cpp
+  X86FoldRedundantInserts.cpp
   X86FrameLowering.cpp
   X86ISelDAGToDAG.cpp
   X86ISelLowering.cpp
Index: lib/Target/X86/X86FoldRedundantInserts.cpp
===================================================================
--- lib/Target/X86/X86FoldRedundantInserts.cpp	(revision 0)
+++ lib/Target/X86/X86FoldRedundantInserts.cpp	(revision 0)
@@ -0,0 +1,185 @@
+//===-- X86FoldRedundantInserts - Remove redundant vector inserts -----------===//
+//
+//                     The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===----------------------------------------------------------------------===//
+//
+// This file defines a pass that searches for opportunities to fold
+// the following patterns:
+//
+// 1)  B = #OPNAME#SSrr B, A
+//     A = #OPINSERT#SSrr A, B<kill>
+//        ==>
+//     A = #OPNAME#SSrr A, B<kill>
+//
+// 2)  B = #OPNAME#PSrr B, A
+//     A = #OPINSERT#SSrr A, B<kill>
+//        ==>
+//     A = #OPNAME#SSrr A, B<kill>
+//
+// With #OPNAME# in { ADD, MUL, SUB, DIV }
+//      #OPINSERT# in { MOV, INSERT }
+//
+// FIXME: this is just a simple prototype. It only implement pattern 1.
+//
+//===----------------------------------------------------------------------===//
+
+#define DEBUG_TYPE "x86-codegen"
+#include "X86.h"
+#include "X86InstrInfo.h"
+#include "llvm/ADT/DenseMap.h"  
+#include "llvm/ADT/Statistic.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/Passes.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Support/raw_ostream.h"
+#include "llvm/Target/TargetInstrInfo.h"
+#include "llvm/Target/TargetMachine.h"
+
+using namespace llvm;
+
+STATISTIC(NumDeletedInserts, "Number of inserts deleted");
+
+namespace {
+  class X86FoldRedundantInserts : public MachineFunctionPass {
+    const TargetRegisterInfo *TRI;
+    const TargetInstrInfo *TII;
+    MachineRegisterInfo *MRI;
+
+  public:
+    static char ID;  // Pass identification, replacement for typeid
+    X86FoldRedundantInserts() : MachineFunctionPass(ID) { }
+
+    virtual void getAnalysisUsage(AnalysisUsage &AU) const {
+      AU.setPreservesCFG();
+      AU.addPreservedID(MachineLoopInfoID);
+      AU.addPreservedID(MachineDominatorsID);
+      MachineFunctionPass::getAnalysisUsage(AU);
+    }
+
+    virtual bool runOnMachineFunction(MachineFunction &MF);
+
+    virtual const char *getPassName() const {
+      return "X86 Remove Redundant Inserts";
+    }
+
+  private:
+    // Candidates - Keep track of potential candidates for removal.
+    DenseMap<unsigned, MachineInstr*> Candidates;
+
+    bool isValidINSERTPS(const MachineInstr &MI) const;
+    bool isValidMOVSS(const MachineInstr &MI) const;
+    bool isValidCandidate(const MachineInstr &MI) const;
+    bool SimplifyBlock(MachineBasicBlock &MBB);
+  };
+}
+
+char X86FoldRedundantInserts::ID = 0;
+
+FunctionPass *llvm::createX86FoldRedundantInsertsPass() {
+  return new X86FoldRedundantInserts();
+}
+
+bool X86FoldRedundantInserts::isValidINSERTPS(const MachineInstr &MI) const {
+  const MachineOperand &Op2 = MI.getOperand(2);
+  const MachineOperand &Op3 = MI.getOperand(3);
+
+  return ((Op3.getType() == MachineOperand::MO_Immediate) &&
+          (Op3.getImm() == 0) &&
+          (Op2.isKill() && !MRI->isReserved(Op2.getReg())));
+}
+
+bool X86FoldRedundantInserts::isValidMOVSS(const MachineInstr &MI) const {
+  const MachineOperand &MO = MI.getOperand(2);
+  unsigned Reg = MO.getReg();
+
+  return !MRI->isReserved(Reg) && MO.isKill();
+}
+
+bool X86FoldRedundantInserts::isValidCandidate(const MachineInstr &MI) const {
+  switch (MI.getOpcode()) {
+    default : return false;
+    case X86::MOVSSrr :
+      return isValidMOVSS(MI);
+    case X86::INSERTPSrr :
+      return isValidINSERTPS(MI);
+  }
+
+  llvm_unreachable(0);
+}
+
+bool X86FoldRedundantInserts::SimplifyBlock(MachineBasicBlock &MBB) {
+  bool Changed = false;
+  MachineBasicBlock::iterator I = MBB.end(), E = MBB.begin();
+
+  while (I != E) {
+    --I;
+    if (I->hasUnmodeledSideEffects()) {
+      // Invalidate the Map.
+      Candidates.clear();
+      continue;
+    }
+
+    if (isValidCandidate(*I)) {
+      // Add this entry to the Candidates map.
+      unsigned RegDef = I->getOperand(0).getReg();
+      Candidates.insert(std::make_pair(RegDef, I));
+      continue;
+    }
+    
+    // Simplify according to the pattern
+    //  X<def,tied1> = #OPCODE#SSrr X<kill,tied0>, Y
+    //  Y<def,tied1> = #INSERT#SSrr Y<kill,tied0>, X<kill>
+    //     ==>
+    //  Y<def,tied1> = #OPCODE#SSrr Y<kill,tied0>, X<kill>
+    if (I->getOpcode() == X86::ADDSSrr || 
+        I->getOpcode() == X86::MULSSrr) {
+      unsigned Def = I->getOperand(2).getReg();
+      
+      DenseMap<unsigned, MachineInstr*>::iterator CI = Candidates.find(Def);
+      if (CI != Candidates.end()) {
+        MachineOperand &MO = I->getOperand(0);
+        MachineInstr *CMI = CI->second;
+ 
+        if (MO.getReg() == CMI->getOperand(2).getReg()) {
+          DEBUG(dbgs() << "Commuting: " << *I);
+          MachineInstr *NewMI = TII->commuteInstruction(I,false);
+          assert(NewMI && "failed to commute the operands!");
+
+          DEBUG(dbgs() << "NewMI is: " << *NewMI);
+          DEBUG(dbgs() << "Removing redundant insert: " << *CMI);
+          CMI->eraseFromParent();
+          NumDeletedInserts++;
+        }
+      }
+ 
+      Candidates.erase(Def);
+      continue;
+    }
+
+    // FIXME: simplify more patterns and update the Candidates
+    // set based on what registers are clobbered by the current
+    // instruction instead of just clearling the map of candidates.
+    Candidates.clear();
+  }
+
+  return Changed;
+}
+
+bool X86FoldRedundantInserts::runOnMachineFunction(MachineFunction &MF) {
+  bool Changed = false;
+  
+  TRI = MF.getTarget().getRegisterInfo();
+  TII = MF.getTarget().getInstrInfo();
+  MRI = &MF.getRegInfo();
+
+  for (MachineFunction::iterator I = MF.begin(), E = MF.end(); I != E; ++I)
+    Changed |= SimplifyBlock(*I);
+
+  return Changed;
+}
+