[PATCH] D87966: [ThinLTO] Re-order modules for optimal multi-threaded processing

Sat Sep 19 08:31:38 PDT 2020

aganea created this revision.
aganea added reviewers: tejohnson, mehdi_amini, pcc, evgeny777.
Herald added subscribers: llvm-commits, dexonsmith, steven_wu, mgrang, hiraditya, inglorion.
Herald added a project: LLVM.
aganea requested review of this revision.

This is a reimplementation of D60495 <https://reviews.llvm.org/D60495> but with Teresa's suggestion applied: https://reviews.llvm.org/D60495#1562871

I've tested a 3-stage compilation, the graph below shows linking of `clang.exe` with `-flto=thin` and `-DLLVM_INTEGRATED_CRT_ALLOC=d:\git\rpmalloc` to work alleviate Windows Heap scaling issues on many-core machines. Test running on 36-core Xeon 6140.

Before (total run is 102 sec):
F13006262: image.png <https://reviews.llvm.org/F13006262>

After patch (total run is 85 sec):
F13006304: image.png <https://reviews.llvm.org/F13006304>

The remaining issue after the falloff in the graph is `PassBuilder.cpp` which takes a long time to opt+codegen. If that file was split into several .CPPs, I suppose the linking could complete in 70 sec.

However there's an issue with this patch. The ThinLTO tests fail, because things are out-of-order. Please see:F13006332: errors.txt <https://reviews.llvm.org/F13006332>
What should I do about that? Some tests could be fixed (ordering) but I am unsure about others.
Could anybody with an expert eye take a look? TIA!


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D87966

Files:
  llvm/include/llvm/LTO/LTO.h
  llvm/lib/LTO/LTO.cpp
  llvm/lib/LTO/ThinLTOCodeGenerator.cpp


Index: llvm/lib/LTO/ThinLTOCodeGenerator.cpp
===================================================================

--- llvm/lib/LTO/ThinLTOCodeGenerator.cpp
+++ llvm/lib/LTO/ThinLTOCodeGenerator.cpp
@@ -1054,19 +1054,11 @@
     ModuleToDefinedGVSummaries[ModuleIdentifier];
   }
 
-  // Compute the ordering we will process the inputs: the rough heuristic here
-  // is to sort them per size so that the largest module get schedule as soon as
-  // possible. This is purely a compile-time optimization.
-  std::vector<int> ModulesOrdering;
-  ModulesOrdering.resize(Modules.size());
-  std::iota(ModulesOrdering.begin(), ModulesOrdering.end(), 0);
-  llvm::sort(ModulesOrdering, [&](int LeftIndex, int RightIndex) {
-    auto LSize =
-        Modules[LeftIndex]->getSingleBitcodeModule().getBuffer().size();
-    auto RSize =
-        Modules[RightIndex]->getSingleBitcodeModule().getBuffer().size();
-    return LSize > RSize;
-  });
+  std::vector<BitcodeModule *> ModulesVec;
+  ModulesVec.reserve(Modules.size());
+  for (auto &Mod : Modules)
+    ModulesVec.push_back(&Mod->getSingleBitcodeModule());
+  std::vector<int> ModulesOrdering = lto::generateModulesOrdering(ModulesVec);
 
   // Parallel optimizer + codegen
   {
Index: llvm/lib/LTO/LTO.cpp
===================================================================
--- llvm/lib/LTO/LTO.cpp
+++ llvm/lib/LTO/LTO.cpp
@@ -1443,10 +1443,17 @@
   auto &ModuleMap =
       ThinLTO.ModulesToCompile ? *ThinLTO.ModulesToCompile : ThinLTO.ModuleMap;
 
+  std::vector<BitcodeModule *> ModulesVec;
+  ModulesVec.reserve(ModuleMap.size());
+  for (auto &Mod : ModuleMap)
+    ModulesVec.push_back(&Mod.second);
+  std::vector<int> ModulesOrdering = generateModulesOrdering(ModulesVec);
+
   // Tasks 0 through ParallelCodeGenParallelismLevel-1 are reserved for combined
   // module and parallel code generation partitions.
   unsigned Task = RegularLTO.ParallelCodeGenParallelismLevel;
-  for (auto &Mod : ModuleMap) {
+  for (auto IndexCount : ModulesOrdering) {
+    auto &Mod = *(ModuleMap.begin() + IndexCount);
     if (Error E = BackendProc->start(Task, Mod.second, ImportLists[Mod.first],
                                      ExportLists[Mod.first],
                                      ResolvedODR[Mod.first], ThinLTO.ModuleMap))
@@ -1495,3 +1502,18 @@
   StatsFile->keep();
   return std::move(StatsFile);
 }
+
+// Compute the ordering we will process the inputs: the rough heuristic here
+// is to sort them per size so that the largest module get schedule as soon as
+// possible. This is purely a compile-time optimization.
+std::vector<int> lto::generateModulesOrdering(ArrayRef<BitcodeModule *> R) {
+  std::vector<int> ModulesOrdering;
+  ModulesOrdering.resize(R.size());
+  std::iota(ModulesOrdering.begin(), ModulesOrdering.end(), 0);
+  llvm::sort(ModulesOrdering, [&](int LeftIndex, int RightIndex) {
+    auto LSize = R[LeftIndex]->getBuffer().size();
+    auto RSize = R[RightIndex]->getBuffer().size();
+    return LSize > RSize;
+  });
+  return ModulesOrdering;
+}
Index: llvm/include/llvm/LTO/LTO.h
===================================================================
--- llvm/include/llvm/LTO/LTO.h
+++ llvm/include/llvm/LTO/LTO.h
@@ -91,6 +91,10 @@
 Expected<std::unique_ptr<ToolOutputFile>>
 setupStatsFile(StringRef StatsFilename);
 
+/// Produce a re-ordered container for optimal multi-threaded processing.
+/// Returns indices to elements into the input array.
+std::vector<int> generateModulesOrdering(ArrayRef<BitcodeModule *> R);
+
 class LTO;
 struct SymbolResolution;
 class ThinBackendProc;


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D87966.292967.patch
Type: text/x-patch
Size: 3577 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20200919/4bd51f09/attachment.bin>