[llvm-dev] Proposal/patch: simple parallel LTO code generation

Wed Aug 12 08:31:46 PDT 2015

Hi Peter,

> On Aug 12, 2015, at 1:52 AM, Peter Collingbourne via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> Hi all,
> 
> The most time consuming part of LTO at opt level 1 is by far the backend code
> generator. (As a reminder, LTO opt level 1 runs a minimal set of passes;
> it is most useful where the motivation behind the use of LTO is to deploy
> a transformation that requires whole program visibility such as control
> flow integrity [1], rather than to optimise the program using whole program
> visibility). Code generation is in principle embarrassingly parallel, as it
> can in principle be partitioned at the function granularity level, however
> there are practical issues that need to be solved before we can parallelise
> code generation for LTO.

That seems definitively something I wanted to explore as I’m sure there are low hanging fruits to get in this area, I’m glad you gave a try :)

> The main issue is that the backend currently makes no effort to be thread safe.
> This can be overcome by observing that it is unnecessary for the backend to
> be thread safe if we arrange for each instance of the backend to operate in
> a different LLVMContext.

These two sentences don’t go well together to me: I believe an LLVMContext per thread is always something that is needed conceptually.
But it won’t “overcome” any part of LLVM that is not thread-safe: the backend still need to make the effort of not modifying any global state.
I have the same use case of parallel CodeGen internally and I had to fix some cases of global mutable state here and there recently, I think I still have a patch on Phabricator about the nulls() stream.
Something that is not completely clear to me either is if the TargetMachine and cie intended to be used in different threads. I ended up having one TargetMachine instance per thread (like you do in the gold-plugin).

> This is the approach that this patch proposes. The
> LTO code generator partitions the combined LTO module into sub-modules, each
> with its own LLVMContext, and runs the code generator on the sub-modules
> in parallel. (Entities in the combined module are partitioned by taking
> the modulus of the hash of the name of the entity, or its comdat if it has
> one.) The resulting native object files can be combined by the linker in
> the usual way.

You ended up with quite a small patch for what it achieves!
It is still a shame we have to duplicate all the module when we partition it. I wonder what if the memory overhead of your approach?

I have no idea if the way you manipulate the linkage would work in all cases, I’m eager to hear what other will have to say about it.

For instance I’m not sure why you’re doing this:

   for (Module::const_iterator I = M->begin(), E = M->end(); I != E; ++I) {
     Function *F = cast<Function>(VMap[I]);
+    if (!CloneDefinition(I)) {
+      F->setLinkage(GlobalValue::ExternalLinkage);

> This approach is reasonably effective. In one experiment, an LTO link of
> Chromium at LTO opt level 1 on an HP Z620 machine took 15m20s without
> parallelism, 8m06s with 4 partitions and 7m27s with 8 partitions.

Is it a machine with 24 cores? The result is already nice, I wonder if is does not scale better because of Amdahl?

Thanks,

— 
Mehdi

> 
> I've attached a patch with an initial implementation of this idea for the
> gold plugin. If this idea seems reasonable, I'll proceed to clean up the
> patch and send it for review on llvm-commits.
> 
> Thanks,
> -- 
> Peter
> 
> [1] http://clang.llvm.org/docs/ControlFlowIntegrity.html
> <module-splitter.diff>_______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev