[lld] r287946 - Parallelize uncompress() and splitIntoPieces().

Wed Nov 30 10:43:34 PST 2016

On Wed, Nov 30, 2016 at 1:29 AM, Sean Silva <chisophugis at gmail.com> wrote:

>
>
> On Mon, Nov 28, 2016 at 11:52 AM, David Blaikie via llvm-commits <
> llvm-commits at lists.llvm.org> wrote:
>
>> You mean why are some debug sections compressed and not others? LLVM
>> compresses any where the compressed size is smaller than the uncompressed
>> size (ie: we don't compress really small sections where the overhead is
>> gerater than the benefit)
>>
>> Or if you mean: why do we compress debug sections but not non-debug
>> sections? Probably because people who cared about debug info implemented it
>> & no one looked at the overall benefit. And also probably the benefit to
>> compress the (very large) debug sections was worth the compute overhead of
>> compressing/decompressing.
>>
>
> I actually wonder about this. For LLD, the cost of decompressing is likely
> to be quite high. LLD already spends a huge amount of its time for debug
> binaries doing string merging. And SHF_COMPRESSED uses gzip which can't
> decompress super fast (like 120MB/s on the output side in a quick
> measurement I just did; DRAM bandwidth is about 100x that). So there may be
> a net loss in linking performance when the input binaries are hot in disk
> cache.
>

That depends on how close your object files are. If your build system is
distributed, network can be a bottleneck, and compressing sections can be a
net win. I agree that faster algorithms than gzip would be better though.

> One way to mitigate this would be to use something like lz4 instead of
> gzip (which is designed for very fast decompression).
>
> -- Sean Silva
>
>
>>
>> So if we were to take a more holistic approach we might find that
>> compressing particularly 'large' sections is what's important - regardless
>> of whether they're debug or non-debug sections.
>>
>> On Mon, Nov 28, 2016 at 11:35 AM Rui Ueyama <ruiu at google.com> wrote:
>>
>>> This may be a silly question, but why do we compress some sections and
>>> don't compress other sections? What is the criteria?
>>>
>>> On Mon, Nov 28, 2016 at 9:45 AM, David Blaikie <dblaikie at gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> On Mon, Nov 28, 2016 at 9:26 AM Rui Ueyama <ruiu at google.com> wrote:
>>>
>>> On Mon, Nov 28, 2016 at 9:21 AM, David Blaikie <dblaikie at gmail.com>
>>> wrote:
>>>
>>> tangentially related to compressed sections: Currently, I take it, lld
>>> decompresses all compressed input sections into memory before producing
>>> output, yes? Is there any chance in the future that lld might use a more
>>> streaming approach to reduce memory overhead? (ie: defer decompressing
>>> until output - and decompress/write out (possibly recompressing) in chunks,
>>> rather than necessary whole sections or all sections)
>>>
>>>
>>> Interesting idea. LLD currently decompresses all live (non-gc'ed)
>>> sections in memory because it may contain mergeable strings or
>>> (theoretically) EH frames that need to be handled specially. But for
>>> regular sections, we could use a streaming approach indeed. It doesn't only
>>> save memory but also improves performance because it eliminates one extra
>>> memory copy (we can write uncompressed data directly to the output buffer).
>>>
>>>
>>> Yep - even for strings, you could read/process the strings into the
>>> stringmap in chunks, rather than reading the whole buffer in, then
>>> inserting them all. (also improves memory locality - so you don't read it
>>> all in, then go back and start cache missing on the beginning of the buffer
>>> to process them)
>>>
>>> It comes to mind because of the memory optimizations I've been
>>> doing/looking at in the DWP tool.
>>>
>>>
>>>
>>>
>>> On Fri, Nov 25, 2016 at 12:15 PM Rui Ueyama via llvm-commits <
>>> llvm-commits at lists.llvm.org> wrote:
>>>
>>> Author: ruiu
>>> Date: Fri Nov 25 14:05:08 2016
>>> New Revision: 287946
>>>
>>> URL: http://llvm.org/viewvc/llvm-project?rev=287946&view=rev
>>> Log:
>>> Parallelize uncompress() and splitIntoPieces().
>>>
>>> Uncompressing section contents and spliting mergeable section contents
>>> into smaller chunks are heavy tasks. They scan entire section contents
>>> and do CPU-intensive tasks such as uncompressing zlib-compressed data
>>> or computing a hash value for each section piece.
>>>
>>> Luckily, these tasks are independent to each other, so we can do that
>>> in parallel_for_each. The number of input sections is large (as opposed
>>> to the number of output sections), so there's a large parallelism here.
>>>
>>> Actually the current design to call uncompress() and splitIntoPieces()
>>> in batch was chosen with doing this in mind. Basically what we need to
>>> do here is to replace `for` with `parallel_for_each`.
>>>
>>> It seems this patch improves latency significantly if linked programs
>>> contain debug info (which in turn contain lots of mergeable strings.)
>>> For example, the latency to link Clang (debug build) improved by 20% on
>>> my machine as shown below. Note that ld.gold took 19.2 seconds to do
>>> the same thing.
>>>
>>> Before:
>>>     30801.782712 task-clock (msec)         #    3.652 CPUs utilized
>>>       ( +-  2.59% )
>>>          104,084 context-switches          #    0.003 M/sec
>>>       ( +-  1.02% )
>>>            5,063 cpu-migrations            #    0.164 K/sec
>>>       ( +- 13.66% )
>>>        2,528,130 page-faults               #    0.082 M/sec
>>>       ( +-  0.47% )
>>>   85,317,809,130 cycles                    #    2.770 GHz
>>>       ( +-  2.62% )
>>>   67,352,463,373 stalled-cycles-frontend   #   78.94% frontend cycles
>>> idle     ( +-  3.06% )
>>>  <not supported> stalled-cycles-backend
>>>   44,295,945,493 instructions              #    0.52  insns per cycle
>>>                                            #    1.52  stalled cycles per
>>> insn  ( +-  0.44% )
>>>    8,572,384,877 branches                  #  278.308 M/sec
>>>       ( +-  0.66% )
>>>      141,806,726 branch-misses             #    1.65% of all branches
>>>       ( +-  0.13% )
>>>
>>>      8.433424003 seconds time elapsed
>>>       ( +-  1.20% )
>>>
>>> After:
>>>     35523.764575 task-clock (msec)         #    5.265 CPUs utilized
>>>       ( +-  2.67% )
>>>          159,107 context-switches          #    0.004 M/sec
>>>       ( +-  0.48% )
>>>            8,123 cpu-migrations            #    0.229 K/sec
>>>       ( +- 23.34% )
>>>        2,372,483 page-faults               #    0.067 M/sec
>>>       ( +-  0.36% )
>>>   98,395,342,152 cycles                    #    2.770 GHz
>>>       ( +-  2.62% )
>>>   79,294,670,125 stalled-cycles-frontend   #   80.59% frontend cycles
>>> idle     ( +-  3.03% )
>>>  <not supported> stalled-cycles-backend
>>>   46,274,151,813 instructions              #    0.47  insns per cycle
>>>                                            #    1.71  stalled cycles per
>>> insn  ( +-  0.47% )
>>>    8,987,621,670 branches                  #  253.003 M/sec
>>>       ( +-  0.60% )
>>>      148,900,624 branch-misses             #    1.66% of all branches
>>>       ( +-  0.27% )
>>>
>>>      6.747548004 seconds time elapsed
>>>       ( +-  0.40% )
>>>
>>> Modified:
>>>     lld/trunk/ELF/Driver.cpp
>>>     lld/trunk/ELF/InputSection.cpp
>>>
>>> Modified: lld/trunk/ELF/Driver.cpp
>>> URL: http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/Driver.cpp
>>> ?rev=287946&r1=287945&r2=287946&view=diff
>>> ============================================================
>>> ==================
>>> --- lld/trunk/ELF/Driver.cpp (original)
>>> +++ lld/trunk/ELF/Driver.cpp Fri Nov 25 14:05:08 2016
>>> @@ -20,6 +20,7 @@
>>>  #include "Target.h"
>>>  #include "Writer.h"
>>>  #include "lld/Config/Version.h"
>>> +#include "lld/Core/Parallel.h"
>>>  #include "lld/Driver/Driver.h"
>>>  #include "llvm/ADT/StringExtras.h"
>>>  #include "llvm/ADT/StringSwitch.h"
>>> @@ -800,14 +801,15 @@ template <class ELFT> void LinkerDriver:
>>>
>>>    // MergeInputSection::splitIntoPieces needs to be called before
>>>    // any call of MergeInputSection::getOffset. Do that.
>>> -  for (InputSectionBase<ELFT> *S : Symtab.Sections) {
>>> -    if (!S->Live)
>>> -      continue;
>>> -    if (S->Compressed)
>>> -      S->uncompress();
>>> -    if (auto *MS = dyn_cast<MergeInputSection<ELFT>>(S))
>>> -      MS->splitIntoPieces();
>>> -  }
>>> +  parallel_for_each(Symtab.Sections.begin(), Symtab.Sections.end(),
>>> +                    [](InputSectionBase<ELFT> *S) {
>>> +                      if (!S->Live)
>>> +                        return;
>>> +                      if (S->Compressed)
>>> +                        S->uncompress();
>>> +                      if (auto *MS = dyn_cast<MergeInputSection<ELF
>>> T>>(S))
>>> +                        MS->splitIntoPieces();
>>> +                    });
>>>
>>>    // Write the result to the file.
>>>    writeResult<ELFT>();
>>>
>>> Modified: lld/trunk/ELF/InputSection.cpp
>>> URL: http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/InputSecti
>>> on.cpp?rev=287946&r1=287945&r2=287946&view=diff
>>> ============================================================
>>> ==================
>>> --- lld/trunk/ELF/InputSection.cpp (original)
>>> +++ lld/trunk/ELF/InputSection.cpp Fri Nov 25 14:05:08 2016
>>> @@ -22,6 +22,7 @@
>>>
>>>  #include "llvm/Support/Compression.h"
>>>  #include "llvm/Support/Endian.h"
>>> +#include <mutex>
>>>
>>>  using namespace llvm;
>>>  using namespace llvm::ELF;
>>> @@ -160,6 +161,8 @@ InputSectionBase<ELFT>::getRawCompressed
>>>    return {Data.slice(sizeof(*Hdr)), read64be(Hdr->Size)};
>>>  }
>>>
>>> +// Uncompress section contents. Note that this function is called
>>> +// from parallel_for_each, so it must be thread-safe.
>>>  template <class ELFT> void InputSectionBase<ELFT>::uncompress() {
>>>    if (!zlib::isAvailable())
>>>      fatal(toString(this) +
>>> @@ -179,7 +182,12 @@ template <class ELFT> void InputSectionB
>>>      std::tie(Buf, Size) = getRawCompressedData(Data);
>>>
>>>    // Uncompress Buf.
>>> -  char *OutputBuf = BAlloc.Allocate<char>(Size);
>>> +  char *OutputBuf;
>>> +  {
>>> +    static std::mutex Mu;
>>> +    std::lock_guard<std::mutex> Lock(Mu);
>>> +    OutputBuf = BAlloc.Allocate<char>(Size);
>>> +  }
>>>    if (zlib::uncompress(toStringRef(Buf), OutputBuf, Size) !=
>>> zlib::StatusOK)
>>>      fatal(toString(this) + ": error while uncompressing section");
>>>    Data = ArrayRef<uint8_t>((uint8_t *)OutputBuf, Size);
>>> @@ -746,6 +754,12 @@ MergeInputSection<ELFT>::MergeInputSecti
>>>                                             StringRef Name)
>>>      : InputSectionBase<ELFT>(F, Header, Name,
>>> InputSectionBase<ELFT>::Merge) {}
>>>
>>> +// This function is called after we obtain a complete list of input
>>> sections
>>> +// that need to be linked. This is responsible to split section contents
>>> +// into small chunks for further processing.
>>> +//
>>> +// Note that this function is called from parallel_for_each. This must
>>> be
>>> +// thread-safe (i.e. no memory allocation from the pools).
>>>  template <class ELFT> void MergeInputSection<ELFT>::splitIntoPieces() {
>>>    ArrayRef<uint8_t> Data = this->Data;
>>>    uintX_t EntSize = this->Entsize;
>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>
>>>
>>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161130/89766282/attachment.html>