[PATCH] D87966: [ThinLTO] Re-order modules for optimal multi-threaded processing
Alexandre Ganea via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Sat Sep 19 08:31:38 PDT 2020
aganea created this revision.
aganea added reviewers: tejohnson, mehdi_amini, pcc, evgeny777.
Herald added subscribers: llvm-commits, dexonsmith, steven_wu, mgrang, hiraditya, inglorion.
Herald added a project: LLVM.
aganea requested review of this revision.
This is a reimplementation of D60495 <https://reviews.llvm.org/D60495> but with Teresa's suggestion applied: https://reviews.llvm.org/D60495#1562871
I've tested a 3-stage compilation, the graph below shows linking of `clang.exe` with `-flto=thin` and `-DLLVM_INTEGRATED_CRT_ALLOC=d:\git\rpmalloc` to work alleviate Windows Heap scaling issues on many-core machines. Test running on 36-core Xeon 6140.
Before (total run is 102 sec):
F13006262: image.png <https://reviews.llvm.org/F13006262>
After patch (total run is 85 sec):
F13006304: image.png <https://reviews.llvm.org/F13006304>
The remaining issue after the falloff in the graph is `PassBuilder.cpp` which takes a long time to opt+codegen. If that file was split into several .CPPs, I suppose the linking could complete in 70 sec.
However there's an issue with this patch. The ThinLTO tests fail, because things are out-of-order. Please see:F13006332: errors.txt <https://reviews.llvm.org/F13006332>
What should I do about that? Some tests could be fixed (ordering) but I am unsure about others.
Could anybody with an expert eye take a look? TIA!
Repository:
rG LLVM Github Monorepo
https://reviews.llvm.org/D87966
Files:
llvm/include/llvm/LTO/LTO.h
llvm/lib/LTO/LTO.cpp
llvm/lib/LTO/ThinLTOCodeGenerator.cpp
Index: llvm/lib/LTO/ThinLTOCodeGenerator.cpp
===================================================================
--- llvm/lib/LTO/ThinLTOCodeGenerator.cpp
+++ llvm/lib/LTO/ThinLTOCodeGenerator.cpp
@@ -1054,19 +1054,11 @@
ModuleToDefinedGVSummaries[ModuleIdentifier];
}
- // Compute the ordering we will process the inputs: the rough heuristic here
- // is to sort them per size so that the largest module get schedule as soon as
- // possible. This is purely a compile-time optimization.
- std::vector<int> ModulesOrdering;
- ModulesOrdering.resize(Modules.size());
- std::iota(ModulesOrdering.begin(), ModulesOrdering.end(), 0);
- llvm::sort(ModulesOrdering, [&](int LeftIndex, int RightIndex) {
- auto LSize =
- Modules[LeftIndex]->getSingleBitcodeModule().getBuffer().size();
- auto RSize =
- Modules[RightIndex]->getSingleBitcodeModule().getBuffer().size();
- return LSize > RSize;
- });
+ std::vector<BitcodeModule *> ModulesVec;
+ ModulesVec.reserve(Modules.size());
+ for (auto &Mod : Modules)
+ ModulesVec.push_back(&Mod->getSingleBitcodeModule());
+ std::vector<int> ModulesOrdering = lto::generateModulesOrdering(ModulesVec);
// Parallel optimizer + codegen
{
Index: llvm/lib/LTO/LTO.cpp
===================================================================
--- llvm/lib/LTO/LTO.cpp
+++ llvm/lib/LTO/LTO.cpp
@@ -1443,10 +1443,17 @@
auto &ModuleMap =
ThinLTO.ModulesToCompile ? *ThinLTO.ModulesToCompile : ThinLTO.ModuleMap;
+ std::vector<BitcodeModule *> ModulesVec;
+ ModulesVec.reserve(ModuleMap.size());
+ for (auto &Mod : ModuleMap)
+ ModulesVec.push_back(&Mod.second);
+ std::vector<int> ModulesOrdering = generateModulesOrdering(ModulesVec);
+
// Tasks 0 through ParallelCodeGenParallelismLevel-1 are reserved for combined
// module and parallel code generation partitions.
unsigned Task = RegularLTO.ParallelCodeGenParallelismLevel;
- for (auto &Mod : ModuleMap) {
+ for (auto IndexCount : ModulesOrdering) {
+ auto &Mod = *(ModuleMap.begin() + IndexCount);
if (Error E = BackendProc->start(Task, Mod.second, ImportLists[Mod.first],
ExportLists[Mod.first],
ResolvedODR[Mod.first], ThinLTO.ModuleMap))
@@ -1495,3 +1502,18 @@
StatsFile->keep();
return std::move(StatsFile);
}
+
+// Compute the ordering we will process the inputs: the rough heuristic here
+// is to sort them per size so that the largest module get schedule as soon as
+// possible. This is purely a compile-time optimization.
+std::vector<int> lto::generateModulesOrdering(ArrayRef<BitcodeModule *> R) {
+ std::vector<int> ModulesOrdering;
+ ModulesOrdering.resize(R.size());
+ std::iota(ModulesOrdering.begin(), ModulesOrdering.end(), 0);
+ llvm::sort(ModulesOrdering, [&](int LeftIndex, int RightIndex) {
+ auto LSize = R[LeftIndex]->getBuffer().size();
+ auto RSize = R[RightIndex]->getBuffer().size();
+ return LSize > RSize;
+ });
+ return ModulesOrdering;
+}
Index: llvm/include/llvm/LTO/LTO.h
===================================================================
--- llvm/include/llvm/LTO/LTO.h
+++ llvm/include/llvm/LTO/LTO.h
@@ -91,6 +91,10 @@
Expected<std::unique_ptr<ToolOutputFile>>
setupStatsFile(StringRef StatsFilename);
+/// Produce a re-ordered container for optimal multi-threaded processing.
+/// Returns indices to elements into the input array.
+std::vector<int> generateModulesOrdering(ArrayRef<BitcodeModule *> R);
+
class LTO;
struct SymbolResolution;
class ThinBackendProc;
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D87966.292967.patch
Type: text/x-patch
Size: 3577 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20200919/4bd51f09/attachment.bin>
More information about the llvm-commits
mailing list