[lld] r288606 - Add comments about the use of threads in LLD.

Mon Dec 5 13:34:41 PST 2016

There might be a way to parallelize it though. Usually you have multiple
files to read, and you can resolve symbols within these files. This step
will produce a set of files that need to be pulled from archive files. So
now you get other set of files that you can read in parallel. That's a
Wavefront pattern https://software.intel.com/en-us/node/506116.

(I'm not claiming that that's an effective approach for this specific
problem, just pointing out that there are parallelisms there regardless
whether we can exploit or not.)

On Sun, Dec 4, 2016 at 6:50 PM, Sean Silva <chisophugis at gmail.com> wrote:

>
>
> On Sun, Dec 4, 2016 at 2:50 PM, Rui Ueyama <ruiu at google.com> wrote:
>
>> Lazy symbols are created only for defined symbols in object files in
>> static archives. LLD doesn't know the complete picture of symbols in that
>> sense. When you pull out a file from an archive, new symbols appear, and
>> you need to resolve them again. Isn't this a sequential process?
>>
>
> Oh yeah, you're right. The problem of determining which objects are pulled
> from which archives is still a transitive closure type algorithm, even with
> LLD's current semantics.
> I can think of a couple things:
> 1. use a parallel/distributed transitive closure algorithm (probably
> fairly complicated)
> 2. the result of the transitive closure computation (which objects are
> pulled from the archives) is going to be highly cacheable across runs, so
> the simple MapReduce approach will still work well most of the time with a
> simple run-to-run cache. This caching problem is actually very similar to
> the ThinLTO problem of which files are needed in order for cross-module
> importing to happen.
>
>
> Anyway, this is all just brainstorming.
>
> -- Sean Silva
>
>
>>
>> On Sun, Dec 4, 2016 at 2:03 PM, Sean Silva <chisophugis at gmail.com> wrote:
>>
>>>
>>>
>>> On Sun, Dec 4, 2016 at 7:09 AM, Rui Ueyama <ruiu at google.com> wrote:
>>>
>>>> On Sun, Dec 4, 2016 at 1:55 AM, Sean Silva <chisophugis at gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Sat, Dec 3, 2016 at 3:35 PM, Rui Ueyama via llvm-commits <
>>>>> llvm-commits at lists.llvm.org> wrote:
>>>>>
>>>>>> Author: ruiu
>>>>>> Date: Sat Dec  3 17:35:22 2016
>>>>>> New Revision: 288606
>>>>>>
>>>>>> URL: http://llvm.org/viewvc/llvm-project?rev=288606&view=rev
>>>>>> Log:
>>>>>> Add comments about the use of threads in LLD.
>>>>>>
>>>>>> Modified:
>>>>>>     lld/trunk/ELF/Threads.h
>>>>>>
>>>>>> Modified: lld/trunk/ELF/Threads.h
>>>>>> URL: http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/Threads.h?
>>>>>> rev=288606&r1=288605&r2=288606&view=diff
>>>>>> ============================================================
>>>>>> ==================
>>>>>> --- lld/trunk/ELF/Threads.h (original)
>>>>>> +++ lld/trunk/ELF/Threads.h Sat Dec  3 17:35:22 2016
>>>>>> @@ -6,6 +6,54 @@
>>>>>>  // License. See LICENSE.TXT for details.
>>>>>>  //
>>>>>>  //===------------------------------------------------------
>>>>>> ----------------===//
>>>>>> +//
>>>>>> +// LLD supports threads to distribute workloads to multiple cores.
>>>>>> Using
>>>>>> +// multicore is most effective when more than one core are idle. At
>>>>>> the
>>>>>> +// last step of a build, it is often the case that a linker is the
>>>>>> only
>>>>>> +// active process on a computer. So, we are naturally interested in
>>>>>> using
>>>>>> +// threads wisely to reduce latency to deliver results to users.
>>>>>> +//
>>>>>> +// That said, we don't want to do "too clever" things using threads.
>>>>>> +// Complex multi-threaded algorithms are sometimes extremely hard to
>>>>>> +// justify the correctness and can easily mess up the entire design.
>>>>>> +//
>>>>>> +// Fortunately, when a linker links large programs (when the link
>>>>>> time is
>>>>>> +// most critical), it spends most of the time to work on massive
>>>>>> number of
>>>>>> +// small pieces of data of the same kind. Here are examples:
>>>>>> +//
>>>>>> +//  - We have hundreds of thousands of input sections that need to be
>>>>>> +//    copied to a result file at the last step of link. Once we fix
>>>>>> a file
>>>>>> +//    layout, each section can be copied to its destination and its
>>>>>> +//    relocations can be applied independently.
>>>>>> +//
>>>>>> +//  - We have tens of millions of small strings when constructing a
>>>>>> +//    mergeable string section.
>>>>>> +//
>>>>>> +// For the cases such as the former, we can just use
>>>>>> parallel_for_each
>>>>>> +// instead of std::for_each (or a plain for loop). Because tasks are
>>>>>> +// completely independent from each other, we can run them in
>>>>>> parallel
>>>>>> +// without any coordination between them. That's very easy to
>>>>>> understand
>>>>>> +// and justify.
>>>>>> +//
>>>>>> +// For the cases such as the latter, we can use parallel algorithms
>>>>>> to
>>>>>> +// deal with massive data. We have to write code for a tailored
>>>>>> algorithm
>>>>>> +// for each problem, but the complexity of multi-threading is
>>>>>> isolated in
>>>>>> +// a single pass and doesn't affect the linker's overall design.
>>>>>> +//
>>>>>> +// The above approach seems to be working fairly well. As an
>>>>>> example, when
>>>>>> +// linking Chromium (output size 1.6 GB), using 4 cores reduces
>>>>>> latency to
>>>>>> +// 75% compared to single core (from 12.66 seconds to 9.55 seconds)
>>>>>> on my
>>>>>> +// machine. Using 40 cores reduces it to 63% (from 12.66 seconds to
>>>>>> 7.95
>>>>>> +// seconds). Because of the Amdahl's law, the speedup is not linear,
>>>>>> but
>>>>>> +// as you add more cores, it gets faster.
>>>>>> +//
>>>>>> +// On a final note, if you are trying to optimize, keep the axiom
>>>>>> "don't
>>>>>> +// guess, measure!" in mind. Some important passes of the linker are
>>>>>> not
>>>>>> +// that slow. For example, resolving all symbols is not a very heavy
>>>>>> pass,
>>>>>> +// although it would be very hard to parallelize it. You want to
>>>>>> first
>>>>>> +// identify a slow pass and then optimize it.
>>>>>>
>>>>>
>>>>> Actually, LLD's symbol resolution (the approach with Lazy symbols for
>>>>> archives) is a perfect example of a MapReduce type problem, so it is
>>>>> actually quite parallelizable.
>>>>> You basically have a huge number of (SymbolName,SymbolValue) pairs and
>>>>> you want to coalesce all values with the same SymbolName into pairs
>>>>> (SymbolName, [SymbolValue1,SymbolValue2,...]) which you can then
>>>>> process all the SymbolValueN's to see which is the real definition. This is
>>>>> precisely the problem that MapReduce solves.
>>>>>
>>>>
>>>> How do you handle static archives?
>>>>
>>>
>>> LLD's archive semantics insert lazy symbols for all the archive members,
>>> so it isn't a problem.
>>>
>>> -- Sean Silva
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> (note: I don't necessarily mean that it needs to be done in a
>>>>> distributed fashion, just that the core problem is really one of coalescing
>>>>> values with the same keys.
>>>>> )
>>>>>
>>>>> MapReduce's core abstraction is also a good tool for deduplicating
>>>>> strings.
>>>>>
>>>>>
>>>>> Richard Smith and I were actually brainstorming at the latest llvm
>>>>> social a distributed linker may be a good fit for the linking problem at
>>>>> Google (but it was just brainstorming; obviously that would be a huge
>>>>> effort and we would need very serious justification before embarking on
>>>>> that effort).
>>>>>
>>>>> -- Sean Silva
>>>>>
>>>>>
>>>>>> +//
>>>>>> +//===------------------------------------------------------
>>>>>> ----------------===//
>>>>>>
>>>>>>  #ifndef LLD_ELF_THREADS_H
>>>>>>  #define LLD_ELF_THREADS_H
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> llvm-commits mailing list
>>>>>> llvm-commits at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161205/6068ac17/attachment.html>