[LLVMdev] [Proposal] Parallelize post-IPO stage.

Fri Jul 12 15:49:06 PDT 2013

Hi, There:

   This is the proposal for parallelizing post-ipo stage. See the 
following for details.

   I also attach a toy-grade rudimentary implementation. This 
implementation can be
used to illustrate some concepts here. This patch is not going to be 
committed.

   Unfortunately, this weekend I will be too busy to read emails. Please 
do not construe
delayed response as being rude :-).

Thanks a lot in advance for your time insightful comments!

Shuxin


The proposal
------------
   It is organized as following:
    1) background info, if you heard "/usr/bin/ls", please skip it
    2) the motivation of parallelize post-IPO stage
    3) how to parallelize post-IPO
    4) the linker problems.
    5) the toy-grade rudimentary implementation
    6) misc

1.Some background
------------------

   The Interprocedural-optimization compilation, aka IPO or IPA, typically
consists of three stages:

   S1) pre-IPO
     Each function goes through some analysis and not-very-aggressive 
optimizations.
     Some information is collected during this stage, this info will be 
to IPO stages.
     This info is usually called summary info.

     The result of this stage is "fake-objects" which is binary files using
     some known object format to encapsulate IR as well as summary info 
along with
     other stuff.

   S2) IPO:
     Compiler works with linker to resolve and merge symbols in the 
"fake-objects"

     Then Interprocedural analyses (IPA) are invoked to perform 
interprocedural
     analysis either based on the summary-info, or directly on the IR.

     Interprocedural optimizations (IPO) are called based on the IPA result.

     In some compilers, IPA and IPO are separated. One reason is that 
many IPAs can
     be directly conduct on the concise summary info, while many IPOs 
need to load
     IRs and bulky annotation/metadata into memory.

   S3) post-IPO:
     Typically consist of Loop-Nest-Opt, Scalar Opt, Code-Gen etc etc. 
While they
     are intra-procedural analyses/optimizers, they may directly benefit 
from
     the info collected in the IPO stages and pass down the road.

   LLVM collectively call S2 and S3 as "LTO CodeGen", which is very 
confusing.

2. Why parallelize post-IPO stage
==============================

   R1) To improve the scalarbility
     It is quite obvious that we are not able to put everything about a 
monster
     program in memory at once.

     Even if you can afford a expensive computer, the address space of a
     single compiler process cannot accommodate a monster program.

   R2) to take advantage of ample HW resource to shorten compile time.
   R3) make debugging lot easier.
      One can triage problems in a much smaller partition rather than
    the huge monster program.

   This proposal is not able to shoot the goal R1 at this moment, 
because during
the IPO stage, currently the compiler brings everything into memory at once.

3. How to parallelize post-IPO stage
====================================

   From 5k' high, the concept is very simple, just to
    step 1).divide the merged IR into small pieces,
    step 2).and compile each of this pieces independendly.
    step 3) the objects of each piece are fed back to linker to are linked
            into an executable, or a dynamic lib.

   Section 3.1 through 3.3 describe these three steps respectively.

3.1. Partitioning
-----------------
   Partitioning is to cut a resonabely-sized chunk from the big merged IRs.
It roughly consists of two steps, 1) determine the partition scheme, which
is relatively easy step, and 2) physically scoop the partition out of
the merged IR, which is much more involved.

3.1.1 Figure out Partition scheme
----------------------------------
   we randomly pick up some function and put them in a partition.
It would be nice to perform some optimization at this moment. One opt
in my mind is to reorder functions in order to reduce working-set and
improve locality.

   Unfortunately, this opt seems to be bit blind at this time, because
     - CallGraph is not annotated with estimated or profiled frequency.
     - some linkers don't respect the order. It seems they just
       remembers the function order of the pristine input obj/fake-obj,
       and enforce this order at final link (link-exec/shared-lib) stage.

   Anyway, I try to ignore all these problems, and try to perform partition
via following steps. Maybe we have some luck on some platforms:

   o. DFS the call-graph, ignoring the self-resursive edges, if freq is
      available, prioritizing the edges (i.e. corresponding to call-sites)
      such that frequent edges are visited first.

   o. Cut the DFS spanning tree obtained from the previous step bottom-up,
      Each cut/partition contains reasonable # of functions, and the 
aggregate
      size of the functions of the partition should not exceeds predefined
      threshold.

  o. repeat the previous step until the Call-graph's DFS spanning tree
      is empty.

3.1.2 Partition transformation
------------------------------

   This is bit involved. There are bunch of problems we have to tackle.
   1) When the use/def of a symbol are separated in different modules,
      its attribute, like linkage, visibility, need  to be changed
      as well.

       [Example 1], if a symbol is flagged as "internal" to the module where
      the it is defined, the linkage need to be changed into "internal"
      to the executable/lib being compiled.

       [Example 2], For compile-time constants, their initialized value
      needs to to cloned to the partitions where it is referenced,
      The rationale is to make the post-ipo passes to take advantage
      of the initlized value to squeeeze some performance.

       In order to not bloat the code size, the cloned constant should
      mark "don't emit". [end of eg2]

        Being able to precisely update symbols' attribute is not only
      vital to correctness, it has significant impact to the the
      performance as well.

        I have not yet taken a thorough investigation of this issue. My
      rudimentary implementation is simply flag symbol "external" when its
      use/def are separated in different module. I believe this is one
      of the most difficult part of this work. I guess it is going to
      take long time to become stable.

   2) In order to compile each partition in each separate thread (see
      Section 3.2), we have to put partitions in distinct LLVMContext.

      I could be wrong, but I don't find the code which is able to
      perform function cloning across LLVMContext.

      My workaround in the patch is to perform function cloning in
     one LLVMContext (but in different Module, of course), then
     save the module to disk file, and load it to memory using a
     new LLVMContext.

      It is bit circuitous and expensive.

      One random observation:
        Currently, function-scoped static variables are considered
      as "global variables". When cloning a function with static variable,
      compiler has no idea if the static variables are used only by
      the function being cloned, and hence separate the function
      and the variables.

         I guess it would be nice if we organized symbols by its scope
      instead of its live-time. it would be convenient for this situation.

3.2 Compile partitions independently
--------------------------------------

    There are two camps: one camp advocate compiling partitions via 
multi-process,
the other one favor multi-thread.

   Inside Apple compiler teams, I'm the only one belong to the 1st comp. 
I think
while multi-proc sounds bit red-neck, it has its advantage for this 
purpose, and
while multi-thread is certainly more eye-popping, it has its advantage
as well.

   The advantage of multi-proc are:
   1) easier to implement, the process run in its own address space.
     We don't need to worry about they can interfere with each other.

   2)huge, or not unlimited, address space.

    The disadvantage is that it's expensive. But I guess the cost is
   almost negligible compared to the overall IPO compilation.

   The advantage of multi-threads I can imagine are:
    1) sound fancy
    2) it is light-weight
    3) inter-thread communication is easier than IPC.

   Its disadvantage are:
    1). Oftentime we will come across race-condition, and it took
       awful long time to figure it out. While the code is supposed
       to be mult-thread safe, we might miss some tricky case.
       Trouble-shooting race condition is a nightmare.

    2) Small address space. This is big problem if we the compiler
       is built 32-bit . In that case, the compiler is not able to bring
       lots of stuff in memory even if the HW dose
       provide ample mem.

    3) The thread-safe run-time lib is more expensive.
       I once linked a compiler using -lpthread (I dose not have to) on a
       UNIX platform,  and saw the compiler slow down by about 1/3.

     I'm not able to convince the folks in other camp, neither are they
  able to convince me. I decide to implement both. Fortunately, this
  part is not difficult, it seems to be rather easy to crank out one 
within short
  period of time. It would be interesting to compare them side-by-side,
  and see which camp lose:-). On the other hand, if we run into 
race-condition
  problem, we choose multi-proc version as a fall-back.

    Regardless which tech is going to use to compile partition
independently, in order to judiciously and adaptively choose appropriate
parallel-factor, the compiler certainly need a lib which is able to
figure out the load the entire system is in. I don't know if there are
such magic lib or not.

4. the tale of two kinds of linker
----------------------------------

   As far as I can tell, llvm suports two kind linker for its IPO 
compilation,
and the supports are embodied by two set of APIs/interfaces.

  o. Interface 1, those stuff named lto_xxxx().
  o. GNU gold interface,
     The compiler interact with GNU gold via the adapter implemented
     in tools/gold/gold-plugin.cpp.

     This adpater calls the interface-1 to control the IPO process.
     It dose not have to call the interface APIs, I think it is definitely
     ok it call internal functions.

   The compiler used to generate a single object file from the merged
IR, now it will generate multiple of them, one for each partition.

   So, the interface 1 is *NOT* sufficient any more.

   For gold linker users, it is easy to make them happy just by
hacking the adapter, informing the linker new input object
files. This is done transparently, the users don't need to install new ld.

   For those system which invoke ld interacting with the libLTO.{so,dylib},
it has to accept the new APIs I added to the interface-1 in order to
enable the new functionality. Or maybe we can invoke '/the/path/to/ld -r 
*.o -o merged.o'
and feed the merged.o the linker (this will  keep the interface
interact)?  Unfortunately, it dose not work at all, how can I know the path
the ld? the libLTO.{so,dydlib} is invoked as plugin; it cannot see the argv.
How about hack them by adding a nasty flag pointing to the right ld?
Well, it works. However, I don't believe many people like to do it this way,
that means I loose huge number of "QA" who are working hard for this 
compiler.

   What's wrong with the interface-1? The ld side is more active than
the compiler side, however, in concept the IPO is driven by the compiler 
side.
This mean this interface is changing over time.

   In contrast, the gold interface (as I rever-engineer from the adpator
code) is more symbol-centric, taking little IPO-thing into account.
That interface is simple and stable.

5. the rudimentary implementation
---------------------------------

   I make it works for bzip2 at cpu2kint yesterday. bzip2 is "tiny"
program, I intentionally lower the partition size to get 3 partitions.
There is no comment in the code, and it definitely need rewrite.  I
just check the correctness (with ref input), and I don't measure how much
it degrade the performance. (due to the problem I have not got chance
to tackle, see section 3.1.2, the symbol attribute stuff).

   The control flow basically is:
    1. add a module pass to the IPO pass-manager, and figure
      out the partition scheme.

    2) physically partition the merged partition.
       the IR and the obj of partition are placed in a new dir. 
"llvmipo" by default

       --
       ls llvmipo/
       Makefile  merged      part1.bc    part1.o     part2.bc 
part2.o     part3.bc    part3.o
       --

    3) For demo purpose, I drive the post-IPO stage via a makefile, 
which encapsulate
       hack and other nasty stuff.

        NOTE that the post-ipo pass in my hack contains only CodeGen 
pass, we need to
      reorganize the PassManagerBuilder::populateLTOPassManager(), which 
intermingle
      IPO pass along with intra-proc scalar pass, we need to separate 
them and the intra-proc
      scalar pass to post-IPO stage.


      1  .PHONY = all
      2
      3
      4  BC = part1.bc part2.bc part3.bc
      5  OBJ = ${BC:.bc=.o}
      6
      7  all : merged
      8  %.o : %.bc
      9      $(HOME)/tmp/lto.llc -filetype=obj $+ -o $@
     10
     11  merged : $(OBJ)
     12      /usr/bin/ld $+ -r -o $@
     13

    4. as the Makefile sugguest, the *.o of the partions are linked into 
a single obj "merged"
      and feed back to link.


6) Miscellaneous
===========
    Will partitioning degrade performance in theory.  I think it depends 
on the definition of
performance.  If performance means execution-time, I guess it dose not.
However, if performance includes code-size, I think it may have some 
negative impact.
Following is few scenario:

    - constants generated by the post-IPO passes are not shared across 
partitions
    - dead func may be detected during the post-IPO stage, and they may 
not be deleted.




-------------- next part --------------
Index: tools/lto/LTOCodeGenerator.cpp
===================================================================

--- tools/lto/LTOCodeGenerator.cpp	(revision 186109)
+++ tools/lto/LTOCodeGenerator.cpp	(working copy)
@@ -17,8 +17,10 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/Analysis/Passes.h"
 #include "llvm/Analysis/Verifier.h"
+#include "llvm/Analysis/CallGraph.h"
 #include "llvm/Bitcode/ReaderWriter.h"
 #include "llvm/Config/config.h"
+#include "llvm/ADT/SetVector.h"
 #include "llvm/IR/Constants.h"
 #include "llvm/IR/DataLayout.h"
 #include "llvm/IR/DerivedTypes.h"
@@ -29,16 +31,19 @@
 #include "llvm/MC/MCContext.h"
 #include "llvm/MC/SubtargetFeature.h"
 #include "llvm/PassManager.h"
+#include "llvm/IRReader/IRReader.h"
 #include "llvm/Support/CommandLine.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/FormattedStream.h"
 #include "llvm/Support/Host.h"
 #include "llvm/Support/MemoryBuffer.h"
+#include "llvm/Support/Program.h"
 #include "llvm/Support/Signals.h"
 #include "llvm/Support/TargetRegistry.h"
 #include "llvm/Support/TargetSelect.h"
 #include "llvm/Support/ToolOutputFile.h"
 #include "llvm/Support/system_error.h"
+#include "llvm/Support/SourceMgr.h"
 #include "llvm/Target/Mangler.h"
 #include "llvm/Target/TargetMachine.h"
 #include "llvm/Target/TargetOptions.h"
@@ -46,8 +51,11 @@
 #include "llvm/Transforms/IPO.h"
 #include "llvm/Transforms/IPO/PassManagerBuilder.h"
 #include "llvm/Transforms/ObjCARC.h"
+#include "llvm/Transforms/Utils/ValueMapper.h"
+#include "llvm/Transforms/Utils/Cloning.h"
 using namespace llvm;
 
+
 static cl::opt<bool>
 DisableOpt("disable-opt", cl::init(false),
   cl::desc("Do not run any optimization passes"));
@@ -68,12 +76,154 @@
 #endif
 }
 
+class ModPartScheme {
+public:
+  typedef SetVector<Function*> PartitionTy;
+  typedef PartitionTy::iterator iterator;
+  typedef PartitionTy::const_iterator const_iterator;
+
+  ModPartScheme() {}
+  ModPartScheme(const ModPartScheme &That) : Partition(That.Partition) {}
+
+  void AddFunction(Function *F) { Partition.insert(F); }
+
+  iterator begin() { return Partition.begin(); }
+  iterator end() { return Partition.end(); }
+  const_iterator begin() const { return Partition.begin(); }
+  const_iterator end() const { return Partition.end(); }
+  int size() { return Partition.size(); }
+  bool count(Function *F) const { return Partition.count(F); }
+
+private:
+  PartitionTy Partition;
+};
+
+class ModPartSchemeMgr {
+public:
+  typedef std::vector<ModPartScheme *> MPSchemeTy;
+  typedef MPSchemeTy::iterator iterator;
+  typedef MPSchemeTy::const_iterator const_iterator;
+
+  ~ModPartSchemeMgr();
+
+  ModPartScheme *CreateEmptyPartition(void) {
+    ModPartScheme *P = new ModPartScheme;
+    PartSchemes.push_back(P);
+    return P;
+  }
+  iterator begin() { return PartSchemes.begin(); }
+  iterator end() { return PartSchemes.end(); }
+  const_iterator begin() const { return PartSchemes.begin(); }
+  const_iterator end() const { return PartSchemes.end(); }
+  int empty() const { return PartSchemes.empty(); }
+  
+private:
+  MPSchemeTy PartSchemes;
+};
+
+class ModPartAnalysis : public ModulePass {
+public:
+  static char ID;
+  ModPartAnalysis(ModPartSchemeMgr &MPSM):
+    ModulePass(ID), ModPartMgr(MPSM), CG(0) {}
+
+  virtual bool runOnModule(Module &M);
+  virtual void getAnalysisUsage(AnalysisUsage &AU) const;
+
+private:
+  // Partition threshold, currently the metric for "size" is the number
+  // of functions in a partition.
+  enum {
+    MaxFuncInPart = 3
+  };
+   
+  class SizeMetric {
+  public:
+    SizeMetric(int func_num=0) : FuncNum(func_num) {};
+    bool ExceedThreshold() const { return FuncNum > MaxFuncInPart; }
+    bool ExceedThresholdTooMuch() const
+      { return FuncNum >= MaxFuncInPart * 3 / 2; }
+    void IncFuncNum(int amt = 1) { FuncNum += amt; };
+    const SizeMetric& operator+=(const SizeMetric &That)
+      { FuncNum += That.FuncNum; return *this; }
+    void Reset() { FuncNum = 0; }
+
+  private:
+    int FuncNum;
+  };
+
+  void setVisited(CallGraphNode *N) { Visited[N] = true; }
+  bool isVisited(CallGraphNode *N) const {
+    return Visited.find(N) != Visited.end();
+  }
+
+  SizeMetric PerformPartitionHelper(CallGraphNode *Root);
+  void EmitPartition(CallGraphNode *DFSRoot, SizeMetric &SM);
+  SizeMetric EvaluateModuleSize(const Module *M) const;
+
+private:
+  ModPartSchemeMgr &ModPartMgr;
+  CallGraph *CG; 
+  std::vector<CallGraphNode *> DFSStack;
+  SizeMetric RemainingModSize; 
+  DenseMap<CallGraphNode *, bool> Visited;
+};
+
+char ModPartAnalysis::ID = 0;
+
+class ModPartXform {
+public:
+  ModPartXform(Module *Mod, ModPartSchemeMgr &MPSM, IPOPartMgr &PM) :
+    PartSchemeMgr(MPSM), IPOPartMgr(PM), MergedModule(Mod), NextPartId(0) {}
+
+  void getWorkDir();
+
+  void PerformTransform();
+
+private:
+  IPOPartition *PerformTransform(ModPartScheme &PartScheme);
+
+  void CollectGlobalSymbol(ModPartScheme &Part, Module *New,
+                           ValueToValueMapTy &VMap);
+  void CollectGlobalSymbol(Function *F, Module *New,
+                           ValueToValueMapTy &VMap);
+
+  Function *CreateFuncDecl(const Function *F, Module *NewMod);
+  GlobalVariable *CreateVarDecl(const GlobalVariable *GV, Module *NewMod);
+  
+private:
+  ModPartSchemeMgr &PartSchemeMgr;
+  IPOPartMgr &IPOPartMgr;
+  Module *MergedModule;
+  int NextPartId;
+};
+
+class PostIPOCompile {
+public:
+  PostIPOCompile(IPOPartMgr &IPM, IPOFileMgr &IFM, bool ToMergeObjs = false) :
+    PartMgr(IPM), FileMgr(IFM), MergedObjFile(0), MergeObjs(ToMergeObjs) {}
+
+  IPOFile *getMergedObjFile() const { return MergedObjFile; }
+
+  bool Compile(std::string &ErrMsg);
+
+private:
+  bool generateMakefile(std::string &ErrMsg);
+
+private:
+  IPOPartMgr &PartMgr;
+  IPOFileMgr &FileMgr;
+  IPOFile *MergedObjFile;
+  bool MergeObjs;
+};
+
 LTOCodeGenerator::LTOCodeGenerator()
   : _context(getGlobalContext()),
     _linker(new Module("ld-temp.o", _context)), _target(NULL),
     _emitDwarfDebugInfo(false), _scopeRestrictionsDone(false),
     _codeModel(LTO_CODEGEN_PIC_MODEL_DYNAMIC),
-    _nativeObjectFile(NULL) {
+    _nativeObjectFile(NULL),
+    _IPOPartMgr(_IPOFileMgr) {
   InitializeAllTargets();
   InitializeAllTargetMCs();
   InitializeAllAsmPrinters();
@@ -161,34 +311,42 @@
 }
 
 bool LTOCodeGenerator::compile_to_file(const char** name, std::string& errMsg) {
-  // make unique temp .o file to put generated object file
-  SmallString<128> Filename;
-  int FD;
-  error_code EC = sys::fs::createTemporaryFile("lto-llvm", "o", FD, Filename);
-  if (EC) {
-    errMsg = EC.message();
+  if (determineTarget(errMsg))
     return true;
-  }
 
-  // generate object file
-  tool_output_file objFile(Filename.c_str(), FD);
+  PostIPOCompile PostIPOStage(_IPOPartMgr, _IPOFileMgr, true/*merge objects*/);
+  if (!_IPOFileMgr.CreateWorkDir(errMsg))
+    return true;
 
-  bool genResult = generateObjectFile(objFile.os(), errMsg);
-  objFile.os().close();
-  if (objFile.os().has_error()) {
-    objFile.os().clear_error();
-    sys::fs::remove(Twine(Filename));
+  performIPO(errMsg, true);
+
+  if (!PostIPOStage.Compile(errMsg))
     return true;
-  }
 
-  objFile.keep();
-  if (genResult) {
-    sys::fs::remove(Twine(Filename));
+  *name = PostIPOStage.getMergedObjFile()->getPath().c_str();
+  return false;
+}
+
+bool LTOCodeGenerator::compile_to_files(const char** name, std::string& errMsg) {
+  if (determineTarget(errMsg))
     return true;
+
+  performIPO(errMsg);
+  
+  // Parallelize post-IPO
+  _nativeObjectPath.clear();
+  PostIPOCompile PostIPOStage(_IPOPartMgr, _IPOFileMgr);
+  if (!PostIPOStage.Compile(errMsg))
+    return true;
+
+  for (IPOPartMgr::iterator I = _IPOPartMgr.begin(), E = _IPOPartMgr.end();
+       I != E; I++) {
+    _nativeObjectPath.append((*I)->getObjFilePath().data());
+    _nativeObjectPath.append('\0');
   }
+  _nativeObjectPath.append('\0');
+  *name = _nativeObjectPath.c_str();
 
-  _nativeObjectPath = Filename.c_str();
-  *name = _nativeObjectPath.c_str();
   return false;
 }
 
@@ -357,16 +515,12 @@
   _scopeRestrictionsDone = true;
 }
 
-/// Optimize merged modules using various IPO passes
-bool LTOCodeGenerator::generateObjectFile(raw_ostream &out,
-                                          std::string &errMsg) {
-  if (this->determineTarget(errMsg))
-    return true;
 
+void LTOCodeGenerator::performIPO(std::string &errMsg, bool PerformPartition) {
   Module* mergedModule = _linker.getModule();
 
   // Mark which symbols can not be internalized
-  this->applyScopeRestrictions();
+  applyScopeRestrictions();
 
   // Instantiate the pass manager to organize the passes.
   PassManager passes;
@@ -390,13 +544,30 @@
   // Make sure everything is still good.
   passes.add(createVerifierPass());
 
+  ModPartSchemeMgr MPSM;
+  if (PerformPartition)
+    passes.add(new ModPartAnalysis(MPSM));
+
+  passes.run(*mergedModule);
+ 
+  if (!MPSM.empty()) {
+    ModPartXform MPT(mergedModule, MPSM, _IPOPartMgr);
+    MPT.PerformTransform();
+  } else {
+    IPOPartition *P = _IPOPartMgr.CreateIPOPart(mergedModule);
+    P->SaveBitCode();
+  }
+}
+
+bool LTOCodeGenerator::performPostLTO(Module *Mod, formatted_raw_ostream &Out,
+                                      std::string &errMsg) {
+  // placeholder for post-IPO scalar opt
+
   PassManager codeGenPasses;
 
   codeGenPasses.add(new DataLayout(*_target->getDataLayout()));
   _target->addAnalysisPasses(codeGenPasses);
 
-  formatted_raw_ostream Out(out);
-
   // If the bitcode files contain ARC code and were compiled with optimization,
   // the ObjCARCContractPass must be run, so do it unconditionally here.
   codeGenPasses.add(createObjCARCContractPass());
@@ -404,16 +575,31 @@
   if (_target->addPassesToEmitFile(codeGenPasses, Out,
                                    TargetMachine::CGFT_ObjectFile)) {
     errMsg = "target file type not supported";
+    return true;;
+  }
+
+  // Run the code generator, and write assembly file
+  codeGenPasses.run(*Mod);
+  return false;
+}
+
+/// Optimize merged modules using various IPO passes
+bool LTOCodeGenerator::generateObjectFile(Module *Mod, const char *FN,
+                                          std::string &errMsg) {
+  std::string errFile;
+  tool_output_file OutFile(FN, errMsg, raw_fd_ostream::F_Binary);
+ 
+  if (!errFile.empty()) {
+    errMsg += errFile;
     return true;
   }
+  OutFile.keep();
 
-  // Run our queue of passes all at once now, efficiently.
-  passes.run(*mergedModule);
+  formatted_raw_ostream OS(OutFile.os());
+  bool Fail = performPostLTO(Mod, OS, errMsg);
+  OutFile.os().close();
 
-  // Run the code generator, and write assembly file
-  codeGenPasses.run(*mergedModule);
-
-  return false; // success
+  return Fail;
 }
 
 /// setCodeGenDebugOptions - Set codegen debugging options to aid in debugging
@@ -428,3 +614,495 @@
     _codegenOptions.push_back(strdup(o.first.str().c_str()));
   }
 }
+
+///////////////////////////////////////////////////////////////////////////
+//
+//     Implementation of ModPartSchemeMgr, ModPartXform
+//
+///////////////////////////////////////////////////////////////////////////
+//
+ModPartSchemeMgr::~ModPartSchemeMgr() {
+  while (!PartSchemes.empty()) {
+    delete PartSchemes.back();
+    PartSchemes.pop_back();
+  }
+}
+
+Function *ModPartXform::CreateFuncDecl(const Function *F, Module *NewModule) {
+  Function *NF = Function::Create(F->getFunctionType(),
+                                  GlobalValue::ExternalLinkage,
+                                  F->getName(), NewModule);
+  NF->copyAttributesFrom(F);
+  return NF;  
+}
+
+static void PromoteGlobalVarLinkage(GlobalVariable *GV) {
+  GV->setLinkage(GlobalValue::ExternalLinkage);
+}
+
+static void PromoteGlobalFuncLinkage(Function *F) {
+  F->setLinkage(GlobalValue::ExternalLinkage);
+}
+
+GlobalVariable *ModPartXform::CreateVarDecl(const GlobalVariable *GV,
+                                            Module *NewMod) {
+  GlobalVariable *G;
+  G = new GlobalVariable(*NewMod, GV->getType()->getElementType(),
+                         GV->isConstant(),
+                         GlobalValue::ExternalLinkage,
+                         0 /* InitVal */,
+                         Twine(GV->getName()), 
+                         0 /* Before */,
+                         GV->getThreadLocalMode(),
+                         GV->getType()->getPointerAddressSpace(),
+                         GV->hasInitializer() ? true : false);
+  return G;
+}
+
+void ModPartXform::CollectGlobalSymbol(Function *F,
+                                       Module *New,
+                                       ValueToValueMapTy &VMap) {
+  DenseMap<Value *, bool> Visited;
+  SmallVector<Constant *, 16> WorkList;
+   
+  for (Function::iterator BI = F->begin(), BE = F->end(); BI != BE; BI++) {
+    for (BasicBlock::iterator II = BI->begin(), IE = BI->end(); 
+         II != IE; II++) {
+      Instruction &Inst = *II;
+      for (User::op_iterator op = Inst.op_begin(), E = Inst.op_end();
+            op != E; ++op) {
+        if (Constant *C = dyn_cast<Constant>(*op)) {
+          if (!isa<BasicBlock>(C) && Visited.find(C) == Visited.end()) {
+            Visited[C] = true;
+            WorkList.push_back(C);
+          }
+        }
+      }
+    }
+
+    while(!WorkList.empty()) {
+      Constant *C = WorkList.pop_back_val();
+      if (GlobalVariable *GV = dyn_cast<GlobalVariable>(C)) {
+        if (VMap.find(GV) == VMap.end()) {
+          VMap[GV] = CreateVarDecl(GV, New);
+          PromoteGlobalVarLinkage(GV);
+        }
+        continue;
+      } else if (Function *Func = dyn_cast<Function>(C)) {
+        if (VMap.find(Func) == VMap.end()) {
+          VMap[Func] = CreateFuncDecl(Func, New);
+          PromoteGlobalFuncLinkage(Func);
+        }
+        continue;
+      }
+
+      for (User::const_op_iterator I = C->op_begin(), E = C->op_end();
+             I != E; ++I) {
+        Constant *C2 = dyn_cast<Constant>(*I);
+        if (C2 && Visited.find(C2) == Visited.end()) {
+          Visited[C2] = true;
+          WorkList.push_back(C2);
+        }
+      }
+    }
+  }
+}
+
+void ModPartXform::CollectGlobalSymbol(ModPartScheme &Part, Module *New,
+                                       ValueToValueMapTy &VMap) {
+  for (ModPartScheme::iterator I = Part.begin(), E = Part.end();
+       I != E; I++) {
+    const Function *F = *I;
+    VMap[F] = CreateFuncDecl(F, New);
+  }
+
+  for (ModPartScheme::iterator I = Part.begin(), E = Part.end();
+       I != E; I++)
+    CollectGlobalSymbol(*I, New, VMap);
+}
+
+void __attribute__((used)) dump_module(Module *M) {
+  std::string EI;
+  tool_output_file f("module.ll", EI);
+  f.keep();
+
+  M->print(f.os(), 0);
+}
+
+void __attribute__((used)) dump_type(Type *T) {
+  T->dump();
+}
+
+// Splitting the merged module by moving the specified functions to the
+// new module
+IPOPartition *ModPartXform::PerformTransform(ModPartScheme &PartScheme) {
+  std::string FN;
+  raw_string_ostream OS(FN);
+  OS << "partition" << NextPartId++;
+
+  Module *NewMod = new Module(OS.str(), MergedModule->getContext());
+  NewMod->setDataLayout(MergedModule->getDataLayout());
+  NewMod->setTargetTriple(MergedModule->getTargetTriple());
+
+  ValueToValueMapTy VMap;
+  CollectGlobalSymbol(PartScheme, NewMod, VMap);
+
+  // Copy over functions in the partition
+  for (ModPartScheme::iterator I = PartScheme.begin(), E = PartScheme.end();
+       I != E; I++) {
+    Function *OF = *I;
+    Function *NF = cast<Function>(VMap[OF]);
+
+    // Steal some code from llvm::CloneFunction. 
+    {
+      // Loop over the arguments, copying the names of the mapped arguments over...
+      Function::arg_iterator DestI = NF->arg_begin();
+      for (Function::const_arg_iterator I = OF->arg_begin(), E = OF->arg_end();
+           I != E; ++I)
+        // Is this argument preserved? WTF, how come an argument is preserved?
+        if (VMap.count(I) == 0) {
+          DestI->setName(I->getName()); // Copy the name over...
+          VMap[I] = DestI++; // Add mapping to VMap
+        }
+    }
+    SmallVector<ReturnInst*, 8> Returns;
+    CloneFunctionInto(NF, OF, VMap, true, Returns);
+
+    OF->deleteBody();
+  }
+
+  IPOPartition *NewPart = IPOPartMgr.CreateIPOPart(NewMod);
+  
+  // We have to save the module to disk such that next time it's loaded
+  // it relong to different context.
+  NewPart->SaveBitCode();
+ 
+  return NewPart;
+}
+
+void ModPartXform::PerformTransform() {
+  for (ModPartSchemeMgr::iterator I = PartSchemeMgr.begin(), 
+         E = PartSchemeMgr.end(); I != E; I++)
+    (void)PerformTransform(**I);
+
+  IPOPartition *MP = IPOPartMgr.CreateIPOPart(MergedModule);
+  MP->SaveBitCode();
+}
+
+///////////////////////////////////////////////////////////////////////////
+//
+//     Implementation of ModPartAnalysis
+//
+///////////////////////////////////////////////////////////////////////////
+//
+void ModPartAnalysis::getAnalysisUsage(AnalysisUsage &AU) const {
+  AU.setPreservesAll();
+  AU.addRequired<CallGraph>();
+}
+
+void ModPartAnalysis::EmitPartition(CallGraphNode *DFSRoot, SizeMetric &SM) {
+  ModPartScheme *P = ModPartMgr.CreateEmptyPartition();
+  while (!DFSStack.empty()) {
+    CallGraphNode *N = DFSStack.back();
+    P->AddFunction(N->getFunction());
+    DFSStack.pop_back();
+    if (N == DFSRoot)
+      break;
+  }
+}
+
+ModPartAnalysis::SizeMetric
+ModPartAnalysis::PerformPartitionHelper(CallGraphNode *R) {
+  SizeMetric SM;
+  setVisited(R);
+
+  // Skip dummy call-graph node or declaration
+  {
+    Function *F = R->getFunction();
+    if (!F || F->isDeclaration())
+      return SM;
+  }
+
+  DFSStack.push_back(R);
+  SM.IncFuncNum();
+
+  for (CallGraphNode::iterator I = R->begin(), E = R->end(); I != E; I++) {
+    CallGraphNode *Callee = (*I).second;
+    if (isVisited(Callee))
+      continue;
+
+    setVisited(Callee);
+
+    // Skip dummy call-graph node or declaration
+    Function *F = R->getFunction();
+    if (!F || F->isDeclaration())
+      continue;
+
+    SizeMetric T = PerformPartitionHelper(Callee);
+    bool Emit = false;
+
+    if (T.ExceedThreshold())
+      Emit = true;
+    else {
+      SM += T;
+      Emit = SM.ExceedThreshold();
+    }
+
+    if (Emit) {
+      EmitPartition(R, SM);
+      SM.Reset();
+      if (!RemainingModSize.ExceedThresholdTooMuch())
+        break;
+    }
+  }
+  return SM;
+}
+
+// Return the "size" of given module.
+ModPartAnalysis::SizeMetric ModPartAnalysis::EvaluateModuleSize
+  (const Module *M) const {
+  SizeMetric S;
+  for (Module::const_iterator I = M->begin(), E = M->end(); I != E; I++) {
+    const Function &F = *I;
+    if (!F.isDeclaration())
+      S.IncFuncNum();
+  }
+  return S;
+}
+
+bool ModPartAnalysis::runOnModule(Module &M) {
+  SizeMetric S = EvaluateModuleSize(&M);
+  if (!S.ExceedThresholdTooMuch()) {
+    // While it may be big, it is okay.
+    return false;
+  }
+
+  if (!(CG = getAnalysisIfAvailable<CallGraph>()))
+    return false;
+
+  CallGraphNode *R = CG->getRoot();  
+  (void)PerformPartitionHelper(R);
+
+  return false;
+}
+
+// /////////////////////////////////////////////////////////////////////////////
+//
+//   Implementation of IPOPartition and IPOPartMgr
+//
+// /////////////////////////////////////////////////////////////////////////////
+//
+IPOPartition::IPOPartition(Module *M, const char *NameWoExt, IPOFileMgr &FM) :
+  Mod(0), Ctx(0), IRFile(0), ObjFile(0), FileNameWoExt(NameWoExt), FileMgr(FM) {
+}
+
+IPOFile &IPOPartition::getIRFile() const {
+  if (IRFile)
+    return *IRFile;
+  else {
+    std::string FN(FileNameWoExt + ".bc");
+    return *(IRFile = FileMgr.CreateIRFile(FN.c_str()));
+  }
+}
+
+IPOFile &IPOPartition::getObjFile() const {
+  if (ObjFile)
+    return *ObjFile;
+  else {
+    std::string FN(FileNameWoExt + ".o");
+    return *(ObjFile = FileMgr.CreateObjFile(FN.c_str()));
+  }
+}
+
+
+bool IPOPartition::SaveBitCode() {
+  if (!Mod) {
+    // the bit-code have already saved in disk
+    return true;
+  }
+
+  IPOFile &F = getIRFile();
+  if (F.ErrOccur())
+    return false;
+
+  raw_fd_ostream OF(F.getPath().c_str(), F.getLastErrStr(),
+                    raw_fd_ostream::F_Binary);
+  WriteBitcodeToFile(Mod, OF);
+  OF.close();
+
+  Mod = 0;
+  delete Ctx;
+  Ctx = 0;
+ 
+  return !F.ErrOccur();
+}
+
+bool IPOPartition::LoadBitCode() {
+  if (Mod)
+    return true;
+
+  IPOFile &F = getIRFile();
+  if (F.ErrOccur())
+    return false;
+
+  Ctx = new LLVMContext;
+  SMDiagnostic Diag;
+  Mod = ParseIRFile(getIRFilePath(), Diag, *Ctx);
+  if (!Mod) {
+    F.getLastErrStr() = Diag.getMessage();
+    return false;
+  }
+
+  return true;
+}
+
+IPOPartition *IPOPartMgr::CreateIPOPart(Module *M) {
+  std::string PartName;
+  raw_string_ostream OS(PartName); 
+  OS << "part" << NextPartId++;
+
+  IPOPartition *P = new IPOPartition(M, OS.str().c_str(), FileMgr);
+  P->Mod = M;
+  IPOParts.push_back(P);
+  return P;
+}
+
+// /////////////////////////////////////////////////////////////////////////////
+//
+//      Implementation of IPOFile and IPOFileMgr 
+//  
+// /////////////////////////////////////////////////////////////////////////////
+IPOFile::IPOFile(const char *DirName, const char* BaseName, bool KeepFile)
+  : Fname(BaseName), Fpath(DirName), Keep(KeepFile) {
+  Fpath = Fpath + "/" + BaseName;
+  Keep = true;
+}
+
+IPOFile::~IPOFile() {
+  if (Keep)
+    sys::fs::remove(Fpath);
+}
+
+IPOFileMgr::IPOFileMgr(): WorkDir("llvmipo") {
+  IRFiles.reserve(20);
+  ObjFiles.reserve(20);
+  OtherFiles.reserve(8);
+  KeepFiles = true;
+  WorkDirCreated = false;
+}
+
+IPOFileMgr::~IPOFileMgr() {
+  if (!KeepFiles) {
+    uint32_t NumRm;
+    sys::fs::remove_all(Twine(WorkDir), NumRm);
+  }
+}
+
+bool IPOFileMgr::CreateWorkDir(std::string &ErrorInfo) {
+  if (WorkDirCreated)
+    return true;
+
+  bool Exist;
+  error_code EC = sys::fs::create_directory(Twine(WorkDir), Exist);
+  if (EC == error_code::success()) {
+    WorkDirCreated = true;
+    return true;
+  }
+ 
+  return false;
+}
+
+IPOFile *IPOFileMgr::CreateIRFile(const char *Name) {
+  IPOFile *F = CreateFile(Name);
+  IRFiles.push_back(F);
+  return F;
+}
+
+IPOFile *IPOFileMgr::CreateObjFile(const char *Name) {
+  IPOFile *F = CreateFile(Name);
+  ObjFiles.push_back(F);
+  return F;
+}
+
+IPOFile *IPOFileMgr::CreateMakefile(const char *Name) {
+  IPOFile *F = CreateFile(Name);
+  OtherFiles.push_back(F);
+  return F;
+}
+
+// /////////////////////////////////////////////////////////////////////////////
+//
+//      Implementation of PostIPOCompile
+//
+// /////////////////////////////////////////////////////////////////////////////
+
+// The makefile looks something like this:
+//
+//  .PHONY = all
+//
+//  BC = part1.bc part2.bc part3.bc 
+//  OBJ = ${BC:.bc=.o}
+//
+//  all : merged.o
+//  %.o : %.bc
+//    $(HOME)/tmp/lto.llc -filetype=obj $< -o $@
+//
+//    merged.o : $(OBJ)
+//        /usr/bin/ld $+ -r -o $@
+//
+bool PostIPOCompile::generateMakefile(std::string &ErrMsg) {
+
+  IPOFile *MkFile = FileMgr.CreateMakefile("Makefile");
+
+  std::string NewErrMsg;
+  raw_fd_ostream Mk(MkFile->getPath().c_str(), NewErrMsg, 0);
+
+  if (!NewErrMsg.empty()) {
+    ErrMsg += NewErrMsg;
+    return false;
+  }
+
+  std::string BCFiles;
+  for (IPOPartMgr::iterator I = PartMgr.begin(), E = PartMgr.end();
+       I != E; I++) {
+    BCFiles += (*I)->getIRFile().getName();
+    BCFiles += " ";
+  }
+
+  Mk << ".PHONY = all\n\n";
+
+  Mk << "\nBC = " <<  BCFiles << "\n";
+  Mk << "OBJ = ${BC:.bc=.o}\n\n";
+
+  if (MergeObjs)
+    Mk << "all : " << MergedObjFile->getName() << "\n";
+  else
+    Mk << "all : $(OBJ)\n";
+
+  // Emit rule
+  Mk << "%.o : %.bc\n\t$(HOME)/tmp/lto.llc -filetype=obj $+ -o $@\n\n";
+
+  if (MergeObjs) {
+    Mk << MergedObjFile->getName() << " : $(OBJ)\n";
+    Mk << "\t/usr/bin/ld $+ -r -o $@\n\n";
+  }
+
+  Mk.close();
+
+  return true;
+}
+
+bool PostIPOCompile::Compile(std::string &ErrMsg) {
+  if (MergeObjs)
+    MergedObjFile = FileMgr.CreateObjFile("merged");
+
+  if (!generateMakefile(ErrMsg))
+    return false;
+
+  const char *args[] = {"/usr/bin/make", "-C", 0, 0};
+  args[2] = FileMgr.getWorkDir().c_str();
+  
+  bool Fail;
+  sys::ExecuteAndWait("/usr/bin/make", args, 0/*envp*/, 0/*redirect*/, 0/*wait*/, 0, &ErrMsg, &Fail);
+  return !Fail;
+}
Index: tools/lto/LTOCodeGenerator.h
===================================================================
--- tools/lto/LTOCodeGenerator.h	(revision 186109)
+++ tools/lto/LTOCodeGenerator.h	(working copy)
@@ -18,6 +18,8 @@
 #include "llvm/ADT/SmallPtrSet.h"
 #include "llvm/ADT/StringMap.h"
 #include "llvm/Linker.h"
+#include "llvm/Support/FormattedStream.h"
+#include "llvm/Support/system_error.h"
 #include <string>
 #include <vector>
 
@@ -28,6 +30,111 @@
   class MemoryBuffer;
   class TargetMachine;
   class raw_ostream;
+
+  class IPOFile;
+  class IPOFileMgr;
+  class IPOPartition {
+  public:
+    bool isInMemory() const { return Mod != 0; }
+    bool SaveBitCode();
+    bool LoadBitCode();
+    const std::string &getIRFilePath() const;
+    const std::string &getObjFilePath() const;
+    Module *getModule() const { return Mod; }
+  
+    IPOFile &getIRFile() const;
+    IPOFile &getObjFile() const;
+
+  private:
+    friend class IPOPartMgr;
+    IPOPartition(Module *M, const char *FileNameWoExt, IPOFileMgr &FM);
+
+    Module *Mod; 
+    LLVMContext *Ctx;
+    mutable IPOFile *IRFile;
+    mutable IPOFile *ObjFile;
+    std::string FileNameWoExt;
+    IPOFileMgr &FileMgr;
+  };
+  
+  class IPOPartMgr {
+  public:
+    typedef std::vector<IPOPartition *> IPOPartsTy;
+    typedef IPOPartsTy::iterator iterator;
+    typedef IPOPartsTy::const_iterator const_iterator;
+  
+    iterator begin() { return IPOParts.begin(); }
+    iterator end() { return IPOParts.end(); }
+    const_iterator begin() const { return IPOParts.begin(); }
+    const_iterator end() const { return IPOParts.end(); }
+
+    IPOPartition *CreateIPOPart(Module *);
+
+    IPOPartMgr(IPOFileMgr &IFM) : FileMgr(IFM), NextPartId(1) {}
+
+  private:
+    IPOPartsTy IPOParts;
+    IPOFileMgr &FileMgr;
+    int NextPartId;
+  };
+
+  class IPOFile {
+  public:
+    const std::string &getName() { return Fname; }
+    const std::string &getPath() { return Fpath; }
+
+    error_code &getLastErrCode() { return LastErr; }
+    std::string &getLastErrStr() { return LastErrStr; }
+
+    bool ErrOccur() const {
+      return LastErr != error_code::success() || !LastErrStr.empty();
+    }
+
+    void setKeep() { Keep = true; }
+    bool toKeep() const { return Keep; }
+
+  private:
+    friend class IPOFileMgr;
+    IPOFile(const char* DirName, const char *BaseName, bool Keep=false);
+    ~IPOFile();
+  
+  private:
+    std::string Fname;
+    std::string Fpath;
+    error_code LastErr;
+    std::string LastErrStr;
+    bool Keep;
+  };
+
+  class IPOFileMgr {
+  public:
+    IPOFileMgr();
+    ~IPOFileMgr();
+
+    bool CreateWorkDir(std::string &ErrorInfo);
+    const std::string &getWorkDir() { return WorkDir; }
+
+    IPOFile *CreateIRFile(const char *Name);
+    IPOFile *CreateObjFile(const char *Name);
+    IPOFile *CreateMakefile(const char *Name);
+
+    typedef std::vector<IPOFile *> FileVect;
+    FileVect &getIRFiles() { return IRFiles; }
+    FileVect &getObjFiles() { return ObjFiles; }
+
+  private:
+    IPOFile *CreateFile(const char *Name) {
+      return new IPOFile(WorkDir.c_str(), Name);
+    }
+
+  private:
+    FileVect IRFiles;
+    FileVect ObjFiles;
+    FileVect OtherFiles;
+    std::string WorkDir;
+    bool KeepFiles;
+    bool WorkDirCreated;
+  };
 }
 
 //===----------------------------------------------------------------------===//
@@ -52,11 +159,16 @@
 
   bool writeMergedModules(const char *path, std::string &errMsg);
   bool compile_to_file(const char **name, std::string &errMsg);
+  bool compile_to_files(const char** name, std::string& errMsg);
   const void *compile(size_t *length, std::string &errMsg);
   void setCodeGenDebugOptions(const char *opts);
 
 private:
-  bool generateObjectFile(llvm::raw_ostream &out, std::string &errMsg);
+  void performIPO(std::string &errMsg, bool PerformPartition=false);
+  bool performPostLTO(llvm::Module *Mod, llvm::formatted_raw_ostream &Out,
+                      std::string &errMsg);
+
+  bool generateObjectFile(llvm::Module *, const char *Out, std::string &errMsg);
   void applyScopeRestrictions();
   void applyRestriction(llvm::GlobalValue &GV,
                         std::vector<const char*> &mustPreserveList,
@@ -78,6 +190,8 @@
   std::vector<char*>          _codegenOptions;
   std::string                 _mCpu;
   std::string                 _nativeObjectPath;
+  llvm::IPOPartMgr            _IPOPartMgr;
+  llvm::IPOFileMgr            _IPOFileMgr;
 };
 
 #endif // LTO_CODE_GENERATOR_H