[lld] r287946 - Parallelize uncompress() and splitIntoPieces().

Wed Nov 30 01:29:16 PST 2016

On Mon, Nov 28, 2016 at 11:52 AM, David Blaikie via llvm-commits <
llvm-commits at lists.llvm.org> wrote:

> You mean why are some debug sections compressed and not others? LLVM
> compresses any where the compressed size is smaller than the uncompressed
> size (ie: we don't compress really small sections where the overhead is
> gerater than the benefit)
>
> Or if you mean: why do we compress debug sections but not non-debug
> sections? Probably because people who cared about debug info implemented it
> & no one looked at the overall benefit. And also probably the benefit to
> compress the (very large) debug sections was worth the compute overhead of
> compressing/decompressing.
>

I actually wonder about this. For LLD, the cost of decompressing is likely
to be quite high. LLD already spends a huge amount of its time for debug
binaries doing string merging. And SHF_COMPRESSED uses gzip which can't
decompress super fast (like 120MB/s on the output side in a quick
measurement I just did; DRAM bandwidth is about 100x that). So there may be
a net loss in linking performance when the input binaries are hot in disk
cache.

One way to mitigate this would be to use something like lz4 instead of gzip
(which is designed for very fast decompression).

-- Sean Silva

>
> So if we were to take a more holistic approach we might find that
> compressing particularly 'large' sections is what's important - regardless
> of whether they're debug or non-debug sections.
>
> On Mon, Nov 28, 2016 at 11:35 AM Rui Ueyama <ruiu at google.com> wrote:
>
>> This may be a silly question, but why do we compress some sections and
>> don't compress other sections? What is the criteria?
>>
>> On Mon, Nov 28, 2016 at 9:45 AM, David Blaikie <dblaikie at gmail.com>
>> wrote:
>>
>>
>>
>> On Mon, Nov 28, 2016 at 9:26 AM Rui Ueyama <ruiu at google.com> wrote:
>>
>> On Mon, Nov 28, 2016 at 9:21 AM, David Blaikie <dblaikie at gmail.com>
>> wrote:
>>
>> tangentially related to compressed sections: Currently, I take it, lld
>> decompresses all compressed input sections into memory before producing
>> output, yes? Is there any chance in the future that lld might use a more
>> streaming approach to reduce memory overhead? (ie: defer decompressing
>> until output - and decompress/write out (possibly recompressing) in chunks,
>> rather than necessary whole sections or all sections)
>>
>>
>> Interesting idea. LLD currently decompresses all live (non-gc'ed)
>> sections in memory because it may contain mergeable strings or
>> (theoretically) EH frames that need to be handled specially. But for
>> regular sections, we could use a streaming approach indeed. It doesn't only
>> save memory but also improves performance because it eliminates one extra
>> memory copy (we can write uncompressed data directly to the output buffer).
>>
>>
>> Yep - even for strings, you could read/process the strings into the
>> stringmap in chunks, rather than reading the whole buffer in, then
>> inserting them all. (also improves memory locality - so you don't read it
>> all in, then go back and start cache missing on the beginning of the buffer
>> to process them)
>>
>> It comes to mind because of the memory optimizations I've been
>> doing/looking at in the DWP tool.
>>
>>
>>
>>
>> On Fri, Nov 25, 2016 at 12:15 PM Rui Ueyama via llvm-commits <
>> llvm-commits at lists.llvm.org> wrote:
>>
>> Author: ruiu
>> Date: Fri Nov 25 14:05:08 2016
>> New Revision: 287946
>>
>> URL: http://llvm.org/viewvc/llvm-project?rev=287946&view=rev
>> Log:
>> Parallelize uncompress() and splitIntoPieces().
>>
>> Uncompressing section contents and spliting mergeable section contents
>> into smaller chunks are heavy tasks. They scan entire section contents
>> and do CPU-intensive tasks such as uncompressing zlib-compressed data
>> or computing a hash value for each section piece.
>>
>> Luckily, these tasks are independent to each other, so we can do that
>> in parallel_for_each. The number of input sections is large (as opposed
>> to the number of output sections), so there's a large parallelism here.
>>
>> Actually the current design to call uncompress() and splitIntoPieces()
>> in batch was chosen with doing this in mind. Basically what we need to
>> do here is to replace `for` with `parallel_for_each`.
>>
>> It seems this patch improves latency significantly if linked programs
>> contain debug info (which in turn contain lots of mergeable strings.)
>> For example, the latency to link Clang (debug build) improved by 20% on
>> my machine as shown below. Note that ld.gold took 19.2 seconds to do
>> the same thing.
>>
>> Before:
>>     30801.782712 task-clock (msec)         #    3.652 CPUs utilized
>>       ( +-  2.59% )
>>          104,084 context-switches          #    0.003 M/sec
>>       ( +-  1.02% )
>>            5,063 cpu-migrations            #    0.164 K/sec
>>       ( +- 13.66% )
>>        2,528,130 page-faults               #    0.082 M/sec
>>       ( +-  0.47% )
>>   85,317,809,130 cycles                    #    2.770 GHz
>>       ( +-  2.62% )
>>   67,352,463,373 stalled-cycles-frontend   #   78.94% frontend cycles
>> idle     ( +-  3.06% )
>>  <not supported> stalled-cycles-backend
>>   44,295,945,493 instructions              #    0.52  insns per cycle
>>                                            #    1.52  stalled cycles per
>> insn  ( +-  0.44% )
>>    8,572,384,877 branches                  #  278.308 M/sec
>>       ( +-  0.66% )
>>      141,806,726 branch-misses             #    1.65% of all branches
>>       ( +-  0.13% )
>>
>>      8.433424003 seconds time elapsed
>>       ( +-  1.20% )
>>
>> After:
>>     35523.764575 task-clock (msec)         #    5.265 CPUs utilized
>>       ( +-  2.67% )
>>          159,107 context-switches          #    0.004 M/sec
>>       ( +-  0.48% )
>>            8,123 cpu-migrations            #    0.229 K/sec
>>       ( +- 23.34% )
>>        2,372,483 page-faults               #    0.067 M/sec
>>       ( +-  0.36% )
>>   98,395,342,152 cycles                    #    2.770 GHz
>>       ( +-  2.62% )
>>   79,294,670,125 stalled-cycles-frontend   #   80.59% frontend cycles
>> idle     ( +-  3.03% )
>>  <not supported> stalled-cycles-backend
>>   46,274,151,813 instructions              #    0.47  insns per cycle
>>                                            #    1.71  stalled cycles per
>> insn  ( +-  0.47% )
>>    8,987,621,670 branches                  #  253.003 M/sec
>>       ( +-  0.60% )
>>      148,900,624 branch-misses             #    1.66% of all branches
>>       ( +-  0.27% )
>>
>>      6.747548004 seconds time elapsed
>>       ( +-  0.40% )
>>
>> Modified:
>>     lld/trunk/ELF/Driver.cpp
>>     lld/trunk/ELF/InputSection.cpp
>>
>> Modified: lld/trunk/ELF/Driver.cpp
>> URL: http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/Driver.
>> cpp?rev=287946&r1=287945&r2=287946&view=diff
>> ============================================================
>> ==================
>> --- lld/trunk/ELF/Driver.cpp (original)
>> +++ lld/trunk/ELF/Driver.cpp Fri Nov 25 14:05:08 2016
>> @@ -20,6 +20,7 @@
>>  #include "Target.h"
>>  #include "Writer.h"
>>  #include "lld/Config/Version.h"
>> +#include "lld/Core/Parallel.h"
>>  #include "lld/Driver/Driver.h"
>>  #include "llvm/ADT/StringExtras.h"
>>  #include "llvm/ADT/StringSwitch.h"
>> @@ -800,14 +801,15 @@ template <class ELFT> void LinkerDriver:
>>
>>    // MergeInputSection::splitIntoPieces needs to be called before
>>    // any call of MergeInputSection::getOffset. Do that.
>> -  for (InputSectionBase<ELFT> *S : Symtab.Sections) {
>> -    if (!S->Live)
>> -      continue;
>> -    if (S->Compressed)
>> -      S->uncompress();
>> -    if (auto *MS = dyn_cast<MergeInputSection<ELFT>>(S))
>> -      MS->splitIntoPieces();
>> -  }
>> +  parallel_for_each(Symtab.Sections.begin(), Symtab.Sections.end(),
>> +                    [](InputSectionBase<ELFT> *S) {
>> +                      if (!S->Live)
>> +                        return;
>> +                      if (S->Compressed)
>> +                        S->uncompress();
>> +                      if (auto *MS = dyn_cast<MergeInputSection<
>> ELFT>>(S))
>> +                        MS->splitIntoPieces();
>> +                    });
>>
>>    // Write the result to the file.
>>    writeResult<ELFT>();
>>
>> Modified: lld/trunk/ELF/InputSection.cpp
>> URL: http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/
>> InputSection.cpp?rev=287946&r1=287945&r2=287946&view=diff
>> ============================================================
>> ==================
>> --- lld/trunk/ELF/InputSection.cpp (original)
>> +++ lld/trunk/ELF/InputSection.cpp Fri Nov 25 14:05:08 2016
>> @@ -22,6 +22,7 @@
>>
>>  #include "llvm/Support/Compression.h"
>>  #include "llvm/Support/Endian.h"
>> +#include <mutex>
>>
>>  using namespace llvm;
>>  using namespace llvm::ELF;
>> @@ -160,6 +161,8 @@ InputSectionBase<ELFT>::getRawCompressed
>>    return {Data.slice(sizeof(*Hdr)), read64be(Hdr->Size)};
>>  }
>>
>> +// Uncompress section contents. Note that this function is called
>> +// from parallel_for_each, so it must be thread-safe.
>>  template <class ELFT> void InputSectionBase<ELFT>::uncompress() {
>>    if (!zlib::isAvailable())
>>      fatal(toString(this) +
>> @@ -179,7 +182,12 @@ template <class ELFT> void InputSectionB
>>      std::tie(Buf, Size) = getRawCompressedData(Data);
>>
>>    // Uncompress Buf.
>> -  char *OutputBuf = BAlloc.Allocate<char>(Size);
>> +  char *OutputBuf;
>> +  {
>> +    static std::mutex Mu;
>> +    std::lock_guard<std::mutex> Lock(Mu);
>> +    OutputBuf = BAlloc.Allocate<char>(Size);
>> +  }
>>    if (zlib::uncompress(toStringRef(Buf), OutputBuf, Size) !=
>> zlib::StatusOK)
>>      fatal(toString(this) + ": error while uncompressing section");
>>    Data = ArrayRef<uint8_t>((uint8_t *)OutputBuf, Size);
>> @@ -746,6 +754,12 @@ MergeInputSection<ELFT>::MergeInputSecti
>>                                             StringRef Name)
>>      : InputSectionBase<ELFT>(F, Header, Name,
>> InputSectionBase<ELFT>::Merge) {}
>>
>> +// This function is called after we obtain a complete list of input
>> sections
>> +// that need to be linked. This is responsible to split section contents
>> +// into small chunks for further processing.
>> +//
>> +// Note that this function is called from parallel_for_each. This must be
>> +// thread-safe (i.e. no memory allocation from the pools).
>>  template <class ELFT> void MergeInputSection<ELFT>::splitIntoPieces() {
>>    ArrayRef<uint8_t> Data = this->Data;
>>    uintX_t EntSize = this->Entsize;
>>
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>
>>
>>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161130/93100625/attachment.html>