[lld] r287946 - Parallelize uncompress() and splitIntoPieces().
Sean Silva via llvm-commits
llvm-commits at lists.llvm.org
Sun Nov 27 14:36:55 PST 2016
On Sun, Nov 27, 2016 at 9:45 AM, Rui Ueyama <ruiu at google.com> wrote:
> On Sat, Nov 26, 2016 at 5:09 PM, Sean Silva <chisophugis at gmail.com> wrote:
>
>> Looking at the perf output is interesting:
>>
>> Before: 2,528,130 page-faults # 0.082 M/sec
>> ( +- 0.47% )
>> After: 2,372,483 page-faults # 0.067 M/sec
>> ( +- 0.36% )
>>
>> Observation: Page faults decreased by over 5% with this change.
>> The only thing I can think of that could cause this is that less overall
>> memory is being allocated from the operating system somehow (maybe malloc
>> can reuse buffers better when this is done in parallel?).
>>
>>
>> Before:
>>
>> 67,352,463,373 stalled-cycles-frontend # 78.94% frontend cycles
>> idle ( +- 3.06% )
>> 44,295,945,493 instructions # 0.52 insns per cycle
>> # 1.52 stalled cycles per
>> insn ( +- 0.44% )
>> After:
>>
>> 79,294,670,125 stalled-cycles-frontend # 80.59% frontend cycles
>> idle ( +- 3.03% )
>> 46,274,151,813 instructions # 0.47 insns per cycle
>> # 1.71 stalled cycles per
>> insn ( +- 0.47% )
>>
>> Observation: LLD is getting very poor processor utilization. The CPU is
>> spending spends most of its time stalled.
>>
>
> We probably need a function that returns the number of idle cores instead
> of the number of existing cores, to enable/disable threading. Even if our
> parallel algorithm achieves the same performance as a non-parallel
> algorithm in theory, there's costs involving thread creation, coordination,
> etc. in reality. If all CPUs are busy, we should run parallel for loops
> using a single thread instead of multiple threads.
>
>
This is what Apple's GCD does:
https://developer.apple.com/reference/dispatch
I'm not aware of a similar thing for Linux.
Michael, do you know if ConcRT does whole-system balancing for parallelism
on Windows?
-- Sean Silva
>
>>
>> -- Sean Silva
>>
>>
>> On Fri, Nov 25, 2016 at 12:05 PM, Rui Ueyama via llvm-commits <
>> llvm-commits at lists.llvm.org> wrote:
>>
>>> Author: ruiu
>>> Date: Fri Nov 25 14:05:08 2016
>>> New Revision: 287946
>>>
>>> URL: http://llvm.org/viewvc/llvm-project?rev=287946&view=rev
>>> Log:
>>> Parallelize uncompress() and splitIntoPieces().
>>>
>>> Uncompressing section contents and spliting mergeable section contents
>>> into smaller chunks are heavy tasks. They scan entire section contents
>>> and do CPU-intensive tasks such as uncompressing zlib-compressed data
>>> or computing a hash value for each section piece.
>>>
>>> Luckily, these tasks are independent to each other, so we can do that
>>> in parallel_for_each. The number of input sections is large (as opposed
>>> to the number of output sections), so there's a large parallelism here.
>>>
>>> Actually the current design to call uncompress() and splitIntoPieces()
>>> in batch was chosen with doing this in mind. Basically what we need to
>>> do here is to replace `for` with `parallel_for_each`.
>>>
>>> It seems this patch improves latency significantly if linked programs
>>> contain debug info (which in turn contain lots of mergeable strings.)
>>> For example, the latency to link Clang (debug build) improved by 20% on
>>> my machine as shown below. Note that ld.gold took 19.2 seconds to do
>>> the same thing.
>>>
>>> Before:
>>> 30801.782712 task-clock (msec) # 3.652 CPUs utilized
>>> ( +- 2.59% )
>>> 104,084 context-switches # 0.003 M/sec
>>> ( +- 1.02% )
>>> 5,063 cpu-migrations # 0.164 K/sec
>>> ( +- 13.66% )
>>> 2,528,130 page-faults # 0.082 M/sec
>>> ( +- 0.47% )
>>> 85,317,809,130 cycles # 2.770 GHz
>>> ( +- 2.62% )
>>> 67,352,463,373 stalled-cycles-frontend # 78.94% frontend cycles
>>> idle ( +- 3.06% )
>>> <not supported> stalled-cycles-backend
>>> 44,295,945,493 instructions # 0.52 insns per cycle
>>> # 1.52 stalled cycles per
>>> insn ( +- 0.44% )
>>> 8,572,384,877 branches # 278.308 M/sec
>>> ( +- 0.66% )
>>> 141,806,726 branch-misses # 1.65% of all branches
>>> ( +- 0.13% )
>>>
>>> 8.433424003 seconds time elapsed
>>> ( +- 1.20% )
>>>
>>> After:
>>> 35523.764575 task-clock (msec) # 5.265 CPUs utilized
>>> ( +- 2.67% )
>>> 159,107 context-switches # 0.004 M/sec
>>> ( +- 0.48% )
>>> 8,123 cpu-migrations # 0.229 K/sec
>>> ( +- 23.34% )
>>> 2,372,483 page-faults # 0.067 M/sec
>>> ( +- 0.36% )
>>> 98,395,342,152 cycles # 2.770 GHz
>>> ( +- 2.62% )
>>> 79,294,670,125 stalled-cycles-frontend # 80.59% frontend cycles
>>> idle ( +- 3.03% )
>>> <not supported> stalled-cycles-backend
>>> 46,274,151,813 instructions # 0.47 insns per cycle
>>> # 1.71 stalled cycles per
>>> insn ( +- 0.47% )
>>> 8,987,621,670 branches # 253.003 M/sec
>>> ( +- 0.60% )
>>> 148,900,624 branch-misses # 1.66% of all branches
>>> ( +- 0.27% )
>>>
>>> 6.747548004 seconds time elapsed
>>> ( +- 0.40% )
>>>
>>> Modified:
>>> lld/trunk/ELF/Driver.cpp
>>> lld/trunk/ELF/InputSection.cpp
>>>
>>> Modified: lld/trunk/ELF/Driver.cpp
>>> URL: http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/Driver.cpp
>>> ?rev=287946&r1=287945&r2=287946&view=diff
>>> ============================================================
>>> ==================
>>> --- lld/trunk/ELF/Driver.cpp (original)
>>> +++ lld/trunk/ELF/Driver.cpp Fri Nov 25 14:05:08 2016
>>> @@ -20,6 +20,7 @@
>>> #include "Target.h"
>>> #include "Writer.h"
>>> #include "lld/Config/Version.h"
>>> +#include "lld/Core/Parallel.h"
>>> #include "lld/Driver/Driver.h"
>>> #include "llvm/ADT/StringExtras.h"
>>> #include "llvm/ADT/StringSwitch.h"
>>> @@ -800,14 +801,15 @@ template <class ELFT> void LinkerDriver:
>>>
>>> // MergeInputSection::splitIntoPieces needs to be called before
>>> // any call of MergeInputSection::getOffset. Do that.
>>> - for (InputSectionBase<ELFT> *S : Symtab.Sections) {
>>> - if (!S->Live)
>>> - continue;
>>> - if (S->Compressed)
>>> - S->uncompress();
>>> - if (auto *MS = dyn_cast<MergeInputSection<ELFT>>(S))
>>> - MS->splitIntoPieces();
>>> - }
>>> + parallel_for_each(Symtab.Sections.begin(), Symtab.Sections.end(),
>>> + [](InputSectionBase<ELFT> *S) {
>>> + if (!S->Live)
>>> + return;
>>> + if (S->Compressed)
>>> + S->uncompress();
>>> + if (auto *MS = dyn_cast<MergeInputSection<ELF
>>> T>>(S))
>>> + MS->splitIntoPieces();
>>> + });
>>>
>>> // Write the result to the file.
>>> writeResult<ELFT>();
>>>
>>> Modified: lld/trunk/ELF/InputSection.cpp
>>> URL: http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/InputSecti
>>> on.cpp?rev=287946&r1=287945&r2=287946&view=diff
>>> ============================================================
>>> ==================
>>> --- lld/trunk/ELF/InputSection.cpp (original)
>>> +++ lld/trunk/ELF/InputSection.cpp Fri Nov 25 14:05:08 2016
>>> @@ -22,6 +22,7 @@
>>>
>>> #include "llvm/Support/Compression.h"
>>> #include "llvm/Support/Endian.h"
>>> +#include <mutex>
>>>
>>> using namespace llvm;
>>> using namespace llvm::ELF;
>>> @@ -160,6 +161,8 @@ InputSectionBase<ELFT>::getRawCompressed
>>> return {Data.slice(sizeof(*Hdr)), read64be(Hdr->Size)};
>>> }
>>>
>>> +// Uncompress section contents. Note that this function is called
>>> +// from parallel_for_each, so it must be thread-safe.
>>> template <class ELFT> void InputSectionBase<ELFT>::uncompress() {
>>> if (!zlib::isAvailable())
>>> fatal(toString(this) +
>>> @@ -179,7 +182,12 @@ template <class ELFT> void InputSectionB
>>> std::tie(Buf, Size) = getRawCompressedData(Data);
>>>
>>> // Uncompress Buf.
>>> - char *OutputBuf = BAlloc.Allocate<char>(Size);
>>> + char *OutputBuf;
>>> + {
>>> + static std::mutex Mu;
>>> + std::lock_guard<std::mutex> Lock(Mu);
>>> + OutputBuf = BAlloc.Allocate<char>(Size);
>>> + }
>>> if (zlib::uncompress(toStringRef(Buf), OutputBuf, Size) !=
>>> zlib::StatusOK)
>>> fatal(toString(this) + ": error while uncompressing section");
>>> Data = ArrayRef<uint8_t>((uint8_t *)OutputBuf, Size);
>>> @@ -746,6 +754,12 @@ MergeInputSection<ELFT>::MergeInputSecti
>>> StringRef Name)
>>> : InputSectionBase<ELFT>(F, Header, Name,
>>> InputSectionBase<ELFT>::Merge) {}
>>>
>>> +// This function is called after we obtain a complete list of input
>>> sections
>>> +// that need to be linked. This is responsible to split section contents
>>> +// into small chunks for further processing.
>>> +//
>>> +// Note that this function is called from parallel_for_each. This must
>>> be
>>> +// thread-safe (i.e. no memory allocation from the pools).
>>> template <class ELFT> void MergeInputSection<ELFT>::splitIntoPieces() {
>>> ArrayRef<uint8_t> Data = this->Data;
>>> uintX_t EntSize = this->Entsize;
>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161127/b96397c6/attachment.html>
More information about the llvm-commits
mailing list