[lld] r287946 - Parallelize uncompress() and splitIntoPieces().

Sun Nov 27 09:45:34 PST 2016

On Sat, Nov 26, 2016 at 5:09 PM, Sean Silva <chisophugis at gmail.com> wrote:

> Looking at the perf output is interesting:
>
> Before:        2,528,130 page-faults               #    0.082 M/sec
>             ( +-  0.47% )
> After:        2,372,483 page-faults               #    0.067 M/sec
>             ( +-  0.36% )
>
> Observation: Page faults decreased by over 5% with this change.
> The only thing I can think of that could cause this is that less overall
> memory is being allocated from the operating system somehow (maybe malloc
> can reuse buffers better when this is done in parallel?).
>
>
> Before:
>
>   67,352,463,373 stalled-cycles-frontend   #   78.94% frontend cycles
> idle     ( +-  3.06% )
>   44,295,945,493 instructions              #    0.52  insns per cycle
>                                            #    1.52  stalled cycles per
> insn  ( +-  0.44% )
> After:
>
>   79,294,670,125 stalled-cycles-frontend   #   80.59% frontend cycles
> idle     ( +-  3.03% )
>   46,274,151,813 instructions              #    0.47  insns per cycle
>                                            #    1.71  stalled cycles per
> insn  ( +-  0.47% )
>
> Observation: LLD is getting very poor processor utilization. The CPU is
> spending spends most of its time stalled.
>

We probably need a function that returns the number of idle cores instead
of the number of existing cores, to enable/disable threading. Even if our
parallel algorithm achieves the same performance as a non-parallel
algorithm in theory, there's costs involving thread creation, coordination,
etc. in reality. If all CPUs are busy, we should run parallel for loops
using a single thread instead of multiple threads.

>
> -- Sean Silva
>
>
> On Fri, Nov 25, 2016 at 12:05 PM, Rui Ueyama via llvm-commits <
> llvm-commits at lists.llvm.org> wrote:
>
>> Author: ruiu
>> Date: Fri Nov 25 14:05:08 2016
>> New Revision: 287946
>>
>> URL: http://llvm.org/viewvc/llvm-project?rev=287946&view=rev
>> Log:
>> Parallelize uncompress() and splitIntoPieces().
>>
>> Uncompressing section contents and spliting mergeable section contents
>> into smaller chunks are heavy tasks. They scan entire section contents
>> and do CPU-intensive tasks such as uncompressing zlib-compressed data
>> or computing a hash value for each section piece.
>>
>> Luckily, these tasks are independent to each other, so we can do that
>> in parallel_for_each. The number of input sections is large (as opposed
>> to the number of output sections), so there's a large parallelism here.
>>
>> Actually the current design to call uncompress() and splitIntoPieces()
>> in batch was chosen with doing this in mind. Basically what we need to
>> do here is to replace `for` with `parallel_for_each`.
>>
>> It seems this patch improves latency significantly if linked programs
>> contain debug info (which in turn contain lots of mergeable strings.)
>> For example, the latency to link Clang (debug build) improved by 20% on
>> my machine as shown below. Note that ld.gold took 19.2 seconds to do
>> the same thing.
>>
>> Before:
>>     30801.782712 task-clock (msec)         #    3.652 CPUs utilized
>>       ( +-  2.59% )
>>          104,084 context-switches          #    0.003 M/sec
>>       ( +-  1.02% )
>>            5,063 cpu-migrations            #    0.164 K/sec
>>       ( +- 13.66% )
>>        2,528,130 page-faults               #    0.082 M/sec
>>       ( +-  0.47% )
>>   85,317,809,130 cycles                    #    2.770 GHz
>>       ( +-  2.62% )
>>   67,352,463,373 stalled-cycles-frontend   #   78.94% frontend cycles
>> idle     ( +-  3.06% )
>>  <not supported> stalled-cycles-backend
>>   44,295,945,493 instructions              #    0.52  insns per cycle
>>                                            #    1.52  stalled cycles per
>> insn  ( +-  0.44% )
>>    8,572,384,877 branches                  #  278.308 M/sec
>>       ( +-  0.66% )
>>      141,806,726 branch-misses             #    1.65% of all branches
>>       ( +-  0.13% )
>>
>>      8.433424003 seconds time elapsed
>>       ( +-  1.20% )
>>
>> After:
>>     35523.764575 task-clock (msec)         #    5.265 CPUs utilized
>>       ( +-  2.67% )
>>          159,107 context-switches          #    0.004 M/sec
>>       ( +-  0.48% )
>>            8,123 cpu-migrations            #    0.229 K/sec
>>       ( +- 23.34% )
>>        2,372,483 page-faults               #    0.067 M/sec
>>       ( +-  0.36% )
>>   98,395,342,152 cycles                    #    2.770 GHz
>>       ( +-  2.62% )
>>   79,294,670,125 stalled-cycles-frontend   #   80.59% frontend cycles
>> idle     ( +-  3.03% )
>>  <not supported> stalled-cycles-backend
>>   46,274,151,813 instructions              #    0.47  insns per cycle
>>                                            #    1.71  stalled cycles per
>> insn  ( +-  0.47% )
>>    8,987,621,670 branches                  #  253.003 M/sec
>>       ( +-  0.60% )
>>      148,900,624 branch-misses             #    1.66% of all branches
>>       ( +-  0.27% )
>>
>>      6.747548004 seconds time elapsed
>>       ( +-  0.40% )
>>
>> Modified:
>>     lld/trunk/ELF/Driver.cpp
>>     lld/trunk/ELF/InputSection.cpp
>>
>> Modified: lld/trunk/ELF/Driver.cpp
>> URL: http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/Driver.cpp
>> ?rev=287946&r1=287945&r2=287946&view=diff
>> ============================================================
>> ==================
>> --- lld/trunk/ELF/Driver.cpp (original)
>> +++ lld/trunk/ELF/Driver.cpp Fri Nov 25 14:05:08 2016
>> @@ -20,6 +20,7 @@
>>  #include "Target.h"
>>  #include "Writer.h"
>>  #include "lld/Config/Version.h"
>> +#include "lld/Core/Parallel.h"
>>  #include "lld/Driver/Driver.h"
>>  #include "llvm/ADT/StringExtras.h"
>>  #include "llvm/ADT/StringSwitch.h"
>> @@ -800,14 +801,15 @@ template <class ELFT> void LinkerDriver:
>>
>>    // MergeInputSection::splitIntoPieces needs to be called before
>>    // any call of MergeInputSection::getOffset. Do that.
>> -  for (InputSectionBase<ELFT> *S : Symtab.Sections) {
>> -    if (!S->Live)
>> -      continue;
>> -    if (S->Compressed)
>> -      S->uncompress();
>> -    if (auto *MS = dyn_cast<MergeInputSection<ELFT>>(S))
>> -      MS->splitIntoPieces();
>> -  }
>> +  parallel_for_each(Symtab.Sections.begin(), Symtab.Sections.end(),
>> +                    [](InputSectionBase<ELFT> *S) {
>> +                      if (!S->Live)
>> +                        return;
>> +                      if (S->Compressed)
>> +                        S->uncompress();
>> +                      if (auto *MS = dyn_cast<MergeInputSection<ELF
>> T>>(S))
>> +                        MS->splitIntoPieces();
>> +                    });
>>
>>    // Write the result to the file.
>>    writeResult<ELFT>();
>>
>> Modified: lld/trunk/ELF/InputSection.cpp
>> URL: http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/InputSecti
>> on.cpp?rev=287946&r1=287945&r2=287946&view=diff
>> ============================================================
>> ==================
>> --- lld/trunk/ELF/InputSection.cpp (original)
>> +++ lld/trunk/ELF/InputSection.cpp Fri Nov 25 14:05:08 2016
>> @@ -22,6 +22,7 @@
>>
>>  #include "llvm/Support/Compression.h"
>>  #include "llvm/Support/Endian.h"
>> +#include <mutex>
>>
>>  using namespace llvm;
>>  using namespace llvm::ELF;
>> @@ -160,6 +161,8 @@ InputSectionBase<ELFT>::getRawCompressed
>>    return {Data.slice(sizeof(*Hdr)), read64be(Hdr->Size)};
>>  }
>>
>> +// Uncompress section contents. Note that this function is called
>> +// from parallel_for_each, so it must be thread-safe.
>>  template <class ELFT> void InputSectionBase<ELFT>::uncompress() {
>>    if (!zlib::isAvailable())
>>      fatal(toString(this) +
>> @@ -179,7 +182,12 @@ template <class ELFT> void InputSectionB
>>      std::tie(Buf, Size) = getRawCompressedData(Data);
>>
>>    // Uncompress Buf.
>> -  char *OutputBuf = BAlloc.Allocate<char>(Size);
>> +  char *OutputBuf;
>> +  {
>> +    static std::mutex Mu;
>> +    std::lock_guard<std::mutex> Lock(Mu);
>> +    OutputBuf = BAlloc.Allocate<char>(Size);
>> +  }
>>    if (zlib::uncompress(toStringRef(Buf), OutputBuf, Size) !=
>> zlib::StatusOK)
>>      fatal(toString(this) + ": error while uncompressing section");
>>    Data = ArrayRef<uint8_t>((uint8_t *)OutputBuf, Size);
>> @@ -746,6 +754,12 @@ MergeInputSection<ELFT>::MergeInputSecti
>>                                             StringRef Name)
>>      : InputSectionBase<ELFT>(F, Header, Name,
>> InputSectionBase<ELFT>::Merge) {}
>>
>> +// This function is called after we obtain a complete list of input
>> sections
>> +// that need to be linked. This is responsible to split section contents
>> +// into small chunks for further processing.
>> +//
>> +// Note that this function is called from parallel_for_each. This must be
>> +// thread-safe (i.e. no memory allocation from the pools).
>>  template <class ELFT> void MergeInputSection<ELFT>::splitIntoPieces() {
>>    ArrayRef<uint8_t> Data = this->Data;
>>    uintX_t EntSize = this->Entsize;
>>
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161127/c0a7d705/attachment.html>