[LLVMdev] X86 - Help on fixing a poor code generation bug
Andrea Di Biagio
andrea.dibiagio at gmail.com
Thu Dec 5 07:35:19 PST 2013
Hi all,
I noticed that the x86 backend tends to emit unnecessary vector insert
instructions immediately after sse scalar fp instructions like
addss/mulss.
For example:
/////////////////////////////////
__m128 foo(__m128 A, __m128 B) {
_mm_add_ss(A, B);
}
/////////////////////////////////
produces the sequence:
addss %xmm0, %xmm1
movss %xmm1, %xmm0
which could be easily optimized into
addss %xmm1, %xmm0
The first step is to understand why the compiler produces the
redundant instructions.
For the example above, the DAG sequence looks like this:
a0 : f32 = extract_vector_elt ( A, 0)
b0 : f32 = extract_vector_elt ( B, 0)
r0 : f32 = fadd a0, b0
result : v4f32 = insert_vector_elt ( A, r0, 0 )
(with A and B of type v4f32).
The 'insert_vector_elt' is custom lowered into either X86ISD::MOVSS or
X86ISD::INSERTPS depending on the target's SSE feature level.
To start I checked if this bug was caused simply by the lack of
specific tablegen patterns to match the complex sequence described
above into a single ADDSS instruction.
However X86InstrSSE.td already defines an instruction X86::ADDSSrr as
a commutative SSE scalar fp add instruction (that works on F32
ValueTypes). Instruction X86::ADDSSrr is used to select 'fadd' nodes
and it will be translated to 'addss' in assembly.
At this stage, the MOVSS/INSERTPS is still required since the ADDSS
alone would not be equivalent to the hardware 'movss' instruction.
I then started investigating the possibility of adding a pass that
runs at 'postRegAlloc' stage.
Before RegAlloc it may not be safe to remove the redundant MOVSSrr
because of the TwoAddressInstruction Pass; this may decide to commute
the operands of the ADDSS/MULSS.
It is possible to write a pass that scans through each basic block in
a function looking for opportunities to fold instructions based on the
following patterns:
//////
B<def, tied1> = #NAME#SSrr B<kill, tied0>, A
A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
==>
A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
/////
/////
B<def, tied1> = #NAME#PSrr B<kill, tied0>, A
A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
==>
A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
/////
With 'NAME' in {ADD, SUB, MUL, DIV}.
I managed to create a very simple prototype pass which simplifies the
code according to pattern #1 (see the patch in attachment).
However, identifying patterns at postRegAlloc stage is definitely more
complicated than matching patterns on dags: instructions are not in
the SSA form; the pass must keep track of defs and kills and leave the
liveness info in a consistent state; some instructions may have side
effects etc.
Last but not least, the register mapping implemented by the register
allocator and how instructions are scheduled may strongly affect the
ability of a pass like this to pattern match specific sequences of
instructions.
In conclusion, my questions are:
Would a patch be acceptable that introduces a new MachineFunctionPass
that runs at 'postRegAlloc' stage?
If not, is there a better way to fix this bug?
Thanks,
Andrea Di Biagio
SN Systems - Sony Computer Entertainment Group
-------------- next part --------------
Index: lib/Target/X86/X86.h
===================================================================
--- lib/Target/X86/X86.h (revision 196508)
+++ lib/Target/X86/X86.h (working copy)
@@ -47,6 +47,8 @@
///
FunctionPass *createX86FloatingPointStackifierPass();
+FunctionPass *createX86FoldRedundantInsertsPass();
+
/// createX86IssueVZeroUpperPass - This pass inserts AVX vzeroupper instructions
/// before each call to avoid transition penalty between functions encoded with
/// AVX and SSE.
Index: lib/Target/X86/X86TargetMachine.cpp
===================================================================
--- lib/Target/X86/X86TargetMachine.cpp (revision 196508)
+++ lib/Target/X86/X86TargetMachine.cpp (working copy)
@@ -197,6 +197,10 @@
bool X86PassConfig::addPostRegAlloc() {
addPass(createX86FloatingPointStackifierPass());
+
+ if (getOptLevel() != CodeGenOpt::None)
+ addPass(createX86FoldRedundantInsertsPass());
+
return true; // -print-machineinstr should print after this.
}
@@ -222,7 +226,6 @@
addPass(createX86FixupLEAs());
ShouldPrint = true;
}
-
return ShouldPrint;
}
Index: lib/Target/X86/CMakeLists.txt
===================================================================
--- lib/Target/X86/CMakeLists.txt (revision 196508)
+++ lib/Target/X86/CMakeLists.txt (working copy)
@@ -18,6 +18,7 @@
X86CodeEmitter.cpp
X86FastISel.cpp
X86FloatingPoint.cpp
+ X86FoldRedundantInserts.cpp
X86FrameLowering.cpp
X86ISelDAGToDAG.cpp
X86ISelLowering.cpp
Index: lib/Target/X86/X86FoldRedundantInserts.cpp
===================================================================
--- lib/Target/X86/X86FoldRedundantInserts.cpp (revision 0)
+++ lib/Target/X86/X86FoldRedundantInserts.cpp (revision 0)
@@ -0,0 +1,185 @@
+//===-- X86FoldRedundantInserts - Remove redundant vector inserts -----------===//
+//
+// The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===----------------------------------------------------------------------===//
+//
+// This file defines a pass that searches for opportunities to fold
+// the following patterns:
+//
+// 1) B = #OPNAME#SSrr B, A
+// A = #OPINSERT#SSrr A, B<kill>
+// ==>
+// A = #OPNAME#SSrr A, B<kill>
+//
+// 2) B = #OPNAME#PSrr B, A
+// A = #OPINSERT#SSrr A, B<kill>
+// ==>
+// A = #OPNAME#SSrr A, B<kill>
+//
+// With #OPNAME# in { ADD, MUL, SUB, DIV }
+// #OPINSERT# in { MOV, INSERT }
+//
+// FIXME: this is just a simple prototype. It only implement pattern 1.
+//
+//===----------------------------------------------------------------------===//
+
+#define DEBUG_TYPE "x86-codegen"
+#include "X86.h"
+#include "X86InstrInfo.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/Statistic.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/Passes.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Support/raw_ostream.h"
+#include "llvm/Target/TargetInstrInfo.h"
+#include "llvm/Target/TargetMachine.h"
+
+using namespace llvm;
+
+STATISTIC(NumDeletedInserts, "Number of inserts deleted");
+
+namespace {
+ class X86FoldRedundantInserts : public MachineFunctionPass {
+ const TargetRegisterInfo *TRI;
+ const TargetInstrInfo *TII;
+ MachineRegisterInfo *MRI;
+
+ public:
+ static char ID; // Pass identification, replacement for typeid
+ X86FoldRedundantInserts() : MachineFunctionPass(ID) { }
+
+ virtual void getAnalysisUsage(AnalysisUsage &AU) const {
+ AU.setPreservesCFG();
+ AU.addPreservedID(MachineLoopInfoID);
+ AU.addPreservedID(MachineDominatorsID);
+ MachineFunctionPass::getAnalysisUsage(AU);
+ }
+
+ virtual bool runOnMachineFunction(MachineFunction &MF);
+
+ virtual const char *getPassName() const {
+ return "X86 Remove Redundant Inserts";
+ }
+
+ private:
+ // Candidates - Keep track of potential candidates for removal.
+ DenseMap<unsigned, MachineInstr*> Candidates;
+
+ bool isValidINSERTPS(const MachineInstr &MI) const;
+ bool isValidMOVSS(const MachineInstr &MI) const;
+ bool isValidCandidate(const MachineInstr &MI) const;
+ bool SimplifyBlock(MachineBasicBlock &MBB);
+ };
+}
+
+char X86FoldRedundantInserts::ID = 0;
+
+FunctionPass *llvm::createX86FoldRedundantInsertsPass() {
+ return new X86FoldRedundantInserts();
+}
+
+bool X86FoldRedundantInserts::isValidINSERTPS(const MachineInstr &MI) const {
+ const MachineOperand &Op2 = MI.getOperand(2);
+ const MachineOperand &Op3 = MI.getOperand(3);
+
+ return ((Op3.getType() == MachineOperand::MO_Immediate) &&
+ (Op3.getImm() == 0) &&
+ (Op2.isKill() && !MRI->isReserved(Op2.getReg())));
+}
+
+bool X86FoldRedundantInserts::isValidMOVSS(const MachineInstr &MI) const {
+ const MachineOperand &MO = MI.getOperand(2);
+ unsigned Reg = MO.getReg();
+
+ return !MRI->isReserved(Reg) && MO.isKill();
+}
+
+bool X86FoldRedundantInserts::isValidCandidate(const MachineInstr &MI) const {
+ switch (MI.getOpcode()) {
+ default : return false;
+ case X86::MOVSSrr :
+ return isValidMOVSS(MI);
+ case X86::INSERTPSrr :
+ return isValidINSERTPS(MI);
+ }
+
+ llvm_unreachable(0);
+}
+
+bool X86FoldRedundantInserts::SimplifyBlock(MachineBasicBlock &MBB) {
+ bool Changed = false;
+ MachineBasicBlock::iterator I = MBB.end(), E = MBB.begin();
+
+ while (I != E) {
+ --I;
+ if (I->hasUnmodeledSideEffects()) {
+ // Invalidate the Map.
+ Candidates.clear();
+ continue;
+ }
+
+ if (isValidCandidate(*I)) {
+ // Add this entry to the Candidates map.
+ unsigned RegDef = I->getOperand(0).getReg();
+ Candidates.insert(std::make_pair(RegDef, I));
+ continue;
+ }
+
+ // Simplify according to the pattern
+ // X<def,tied1> = #OPCODE#SSrr X<kill,tied0>, Y
+ // Y<def,tied1> = #INSERT#SSrr Y<kill,tied0>, X<kill>
+ // ==>
+ // Y<def,tied1> = #OPCODE#SSrr Y<kill,tied0>, X<kill>
+ if (I->getOpcode() == X86::ADDSSrr ||
+ I->getOpcode() == X86::MULSSrr) {
+ unsigned Def = I->getOperand(2).getReg();
+
+ DenseMap<unsigned, MachineInstr*>::iterator CI = Candidates.find(Def);
+ if (CI != Candidates.end()) {
+ MachineOperand &MO = I->getOperand(0);
+ MachineInstr *CMI = CI->second;
+
+ if (MO.getReg() == CMI->getOperand(2).getReg()) {
+ DEBUG(dbgs() << "Commuting: " << *I);
+ MachineInstr *NewMI = TII->commuteInstruction(I,false);
+ assert(NewMI && "failed to commute the operands!");
+
+ DEBUG(dbgs() << "NewMI is: " << *NewMI);
+ DEBUG(dbgs() << "Removing redundant insert: " << *CMI);
+ CMI->eraseFromParent();
+ NumDeletedInserts++;
+ }
+ }
+
+ Candidates.erase(Def);
+ continue;
+ }
+
+ // FIXME: simplify more patterns and update the Candidates
+ // set based on what registers are clobbered by the current
+ // instruction instead of just clearling the map of candidates.
+ Candidates.clear();
+ }
+
+ return Changed;
+}
+
+bool X86FoldRedundantInserts::runOnMachineFunction(MachineFunction &MF) {
+ bool Changed = false;
+
+ TRI = MF.getTarget().getRegisterInfo();
+ TII = MF.getTarget().getInstrInfo();
+ MRI = &MF.getRegInfo();
+
+ for (MachineFunction::iterator I = MF.begin(), E = MF.end(); I != E; ++I)
+ Changed |= SimplifyBlock(*I);
+
+ return Changed;
+}
+
More information about the llvm-dev
mailing list