[lld] r287946 - Parallelize uncompress() and splitIntoPieces().
Sean Silva via llvm-commits
llvm-commits at lists.llvm.org
Wed Nov 30 18:54:29 PST 2016
On Wed, Nov 30, 2016 at 10:43 AM, Rui Ueyama <ruiu at google.com> wrote:
> On Wed, Nov 30, 2016 at 1:29 AM, Sean Silva <chisophugis at gmail.com> wrote:
>
>>
>>
>> On Mon, Nov 28, 2016 at 11:52 AM, David Blaikie via llvm-commits <
>> llvm-commits at lists.llvm.org> wrote:
>>
>>> You mean why are some debug sections compressed and not others? LLVM
>>> compresses any where the compressed size is smaller than the uncompressed
>>> size (ie: we don't compress really small sections where the overhead is
>>> gerater than the benefit)
>>>
>>> Or if you mean: why do we compress debug sections but not non-debug
>>> sections? Probably because people who cared about debug info implemented it
>>> & no one looked at the overall benefit. And also probably the benefit to
>>> compress the (very large) debug sections was worth the compute overhead of
>>> compressing/decompressing.
>>>
>>
>> I actually wonder about this. For LLD, the cost of decompressing is
>> likely to be quite high. LLD already spends a huge amount of its time for
>> debug binaries doing string merging. And SHF_COMPRESSED uses gzip which
>> can't decompress super fast (like 120MB/s on the output side in a quick
>> measurement I just did; DRAM bandwidth is about 100x that). So there may be
>> a net loss in linking performance when the input binaries are hot in disk
>> cache.
>>
>
> That depends on how close your object files are. If your build system is
> distributed, network can be a bottleneck, and compressing sections can be a
> net win. I agree that faster algorithms than gzip would be better though.
>
The networking layer used to distribute the object files might already have
its own compression. Same for the underlying filesystem.
-- Sean Silva
>
>
>> One way to mitigate this would be to use something like lz4 instead of
>> gzip (which is designed for very fast decompression).
>>
>> -- Sean Silva
>>
>>
>>>
>>> So if we were to take a more holistic approach we might find that
>>> compressing particularly 'large' sections is what's important - regardless
>>> of whether they're debug or non-debug sections.
>>>
>>> On Mon, Nov 28, 2016 at 11:35 AM Rui Ueyama <ruiu at google.com> wrote:
>>>
>>>> This may be a silly question, but why do we compress some sections and
>>>> don't compress other sections? What is the criteria?
>>>>
>>>> On Mon, Nov 28, 2016 at 9:45 AM, David Blaikie <dblaikie at gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> On Mon, Nov 28, 2016 at 9:26 AM Rui Ueyama <ruiu at google.com> wrote:
>>>>
>>>> On Mon, Nov 28, 2016 at 9:21 AM, David Blaikie <dblaikie at gmail.com>
>>>> wrote:
>>>>
>>>> tangentially related to compressed sections: Currently, I take it, lld
>>>> decompresses all compressed input sections into memory before producing
>>>> output, yes? Is there any chance in the future that lld might use a more
>>>> streaming approach to reduce memory overhead? (ie: defer decompressing
>>>> until output - and decompress/write out (possibly recompressing) in chunks,
>>>> rather than necessary whole sections or all sections)
>>>>
>>>>
>>>> Interesting idea. LLD currently decompresses all live (non-gc'ed)
>>>> sections in memory because it may contain mergeable strings or
>>>> (theoretically) EH frames that need to be handled specially. But for
>>>> regular sections, we could use a streaming approach indeed. It doesn't only
>>>> save memory but also improves performance because it eliminates one extra
>>>> memory copy (we can write uncompressed data directly to the output buffer).
>>>>
>>>>
>>>> Yep - even for strings, you could read/process the strings into the
>>>> stringmap in chunks, rather than reading the whole buffer in, then
>>>> inserting them all. (also improves memory locality - so you don't read it
>>>> all in, then go back and start cache missing on the beginning of the buffer
>>>> to process them)
>>>>
>>>> It comes to mind because of the memory optimizations I've been
>>>> doing/looking at in the DWP tool.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Nov 25, 2016 at 12:15 PM Rui Ueyama via llvm-commits <
>>>> llvm-commits at lists.llvm.org> wrote:
>>>>
>>>> Author: ruiu
>>>> Date: Fri Nov 25 14:05:08 2016
>>>> New Revision: 287946
>>>>
>>>> URL: http://llvm.org/viewvc/llvm-project?rev=287946&view=rev
>>>> Log:
>>>> Parallelize uncompress() and splitIntoPieces().
>>>>
>>>> Uncompressing section contents and spliting mergeable section contents
>>>> into smaller chunks are heavy tasks. They scan entire section contents
>>>> and do CPU-intensive tasks such as uncompressing zlib-compressed data
>>>> or computing a hash value for each section piece.
>>>>
>>>> Luckily, these tasks are independent to each other, so we can do that
>>>> in parallel_for_each. The number of input sections is large (as opposed
>>>> to the number of output sections), so there's a large parallelism here.
>>>>
>>>> Actually the current design to call uncompress() and splitIntoPieces()
>>>> in batch was chosen with doing this in mind. Basically what we need to
>>>> do here is to replace `for` with `parallel_for_each`.
>>>>
>>>> It seems this patch improves latency significantly if linked programs
>>>> contain debug info (which in turn contain lots of mergeable strings.)
>>>> For example, the latency to link Clang (debug build) improved by 20% on
>>>> my machine as shown below. Note that ld.gold took 19.2 seconds to do
>>>> the same thing.
>>>>
>>>> Before:
>>>> 30801.782712 task-clock (msec) # 3.652 CPUs utilized
>>>> ( +- 2.59% )
>>>> 104,084 context-switches # 0.003 M/sec
>>>> ( +- 1.02% )
>>>> 5,063 cpu-migrations # 0.164 K/sec
>>>> ( +- 13.66% )
>>>> 2,528,130 page-faults # 0.082 M/sec
>>>> ( +- 0.47% )
>>>> 85,317,809,130 cycles # 2.770 GHz
>>>> ( +- 2.62% )
>>>> 67,352,463,373 stalled-cycles-frontend # 78.94% frontend cycles
>>>> idle ( +- 3.06% )
>>>> <not supported> stalled-cycles-backend
>>>> 44,295,945,493 instructions # 0.52 insns per cycle
>>>> # 1.52 stalled cycles
>>>> per insn ( +- 0.44% )
>>>> 8,572,384,877 branches # 278.308 M/sec
>>>> ( +- 0.66% )
>>>> 141,806,726 branch-misses # 1.65% of all branches
>>>> ( +- 0.13% )
>>>>
>>>> 8.433424003 seconds time elapsed
>>>> ( +- 1.20% )
>>>>
>>>> After:
>>>> 35523.764575 task-clock (msec) # 5.265 CPUs utilized
>>>> ( +- 2.67% )
>>>> 159,107 context-switches # 0.004 M/sec
>>>> ( +- 0.48% )
>>>> 8,123 cpu-migrations # 0.229 K/sec
>>>> ( +- 23.34% )
>>>> 2,372,483 page-faults # 0.067 M/sec
>>>> ( +- 0.36% )
>>>> 98,395,342,152 cycles # 2.770 GHz
>>>> ( +- 2.62% )
>>>> 79,294,670,125 stalled-cycles-frontend # 80.59% frontend cycles
>>>> idle ( +- 3.03% )
>>>> <not supported> stalled-cycles-backend
>>>> 46,274,151,813 instructions # 0.47 insns per cycle
>>>> # 1.71 stalled cycles
>>>> per insn ( +- 0.47% )
>>>> 8,987,621,670 branches # 253.003 M/sec
>>>> ( +- 0.60% )
>>>> 148,900,624 branch-misses # 1.66% of all branches
>>>> ( +- 0.27% )
>>>>
>>>> 6.747548004 seconds time elapsed
>>>> ( +- 0.40% )
>>>>
>>>> Modified:
>>>> lld/trunk/ELF/Driver.cpp
>>>> lld/trunk/ELF/InputSection.cpp
>>>>
>>>> Modified: lld/trunk/ELF/Driver.cpp
>>>> URL: http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/Driver.cpp
>>>> ?rev=287946&r1=287945&r2=287946&view=diff
>>>> ============================================================
>>>> ==================
>>>> --- lld/trunk/ELF/Driver.cpp (original)
>>>> +++ lld/trunk/ELF/Driver.cpp Fri Nov 25 14:05:08 2016
>>>> @@ -20,6 +20,7 @@
>>>> #include "Target.h"
>>>> #include "Writer.h"
>>>> #include "lld/Config/Version.h"
>>>> +#include "lld/Core/Parallel.h"
>>>> #include "lld/Driver/Driver.h"
>>>> #include "llvm/ADT/StringExtras.h"
>>>> #include "llvm/ADT/StringSwitch.h"
>>>> @@ -800,14 +801,15 @@ template <class ELFT> void LinkerDriver:
>>>>
>>>> // MergeInputSection::splitIntoPieces needs to be called before
>>>> // any call of MergeInputSection::getOffset. Do that.
>>>> - for (InputSectionBase<ELFT> *S : Symtab.Sections) {
>>>> - if (!S->Live)
>>>> - continue;
>>>> - if (S->Compressed)
>>>> - S->uncompress();
>>>> - if (auto *MS = dyn_cast<MergeInputSection<ELFT>>(S))
>>>> - MS->splitIntoPieces();
>>>> - }
>>>> + parallel_for_each(Symtab.Sections.begin(), Symtab.Sections.end(),
>>>> + [](InputSectionBase<ELFT> *S) {
>>>> + if (!S->Live)
>>>> + return;
>>>> + if (S->Compressed)
>>>> + S->uncompress();
>>>> + if (auto *MS = dyn_cast<MergeInputSection<ELF
>>>> T>>(S))
>>>> + MS->splitIntoPieces();
>>>> + });
>>>>
>>>> // Write the result to the file.
>>>> writeResult<ELFT>();
>>>>
>>>> Modified: lld/trunk/ELF/InputSection.cpp
>>>> URL: http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/InputSecti
>>>> on.cpp?rev=287946&r1=287945&r2=287946&view=diff
>>>> ============================================================
>>>> ==================
>>>> --- lld/trunk/ELF/InputSection.cpp (original)
>>>> +++ lld/trunk/ELF/InputSection.cpp Fri Nov 25 14:05:08 2016
>>>> @@ -22,6 +22,7 @@
>>>>
>>>> #include "llvm/Support/Compression.h"
>>>> #include "llvm/Support/Endian.h"
>>>> +#include <mutex>
>>>>
>>>> using namespace llvm;
>>>> using namespace llvm::ELF;
>>>> @@ -160,6 +161,8 @@ InputSectionBase<ELFT>::getRawCompressed
>>>> return {Data.slice(sizeof(*Hdr)), read64be(Hdr->Size)};
>>>> }
>>>>
>>>> +// Uncompress section contents. Note that this function is called
>>>> +// from parallel_for_each, so it must be thread-safe.
>>>> template <class ELFT> void InputSectionBase<ELFT>::uncompress() {
>>>> if (!zlib::isAvailable())
>>>> fatal(toString(this) +
>>>> @@ -179,7 +182,12 @@ template <class ELFT> void InputSectionB
>>>> std::tie(Buf, Size) = getRawCompressedData(Data);
>>>>
>>>> // Uncompress Buf.
>>>> - char *OutputBuf = BAlloc.Allocate<char>(Size);
>>>> + char *OutputBuf;
>>>> + {
>>>> + static std::mutex Mu;
>>>> + std::lock_guard<std::mutex> Lock(Mu);
>>>> + OutputBuf = BAlloc.Allocate<char>(Size);
>>>> + }
>>>> if (zlib::uncompress(toStringRef(Buf), OutputBuf, Size) !=
>>>> zlib::StatusOK)
>>>> fatal(toString(this) + ": error while uncompressing section");
>>>> Data = ArrayRef<uint8_t>((uint8_t *)OutputBuf, Size);
>>>> @@ -746,6 +754,12 @@ MergeInputSection<ELFT>::MergeInputSecti
>>>> StringRef Name)
>>>> : InputSectionBase<ELFT>(F, Header, Name,
>>>> InputSectionBase<ELFT>::Merge) {}
>>>>
>>>> +// This function is called after we obtain a complete list of input
>>>> sections
>>>> +// that need to be linked. This is responsible to split section
>>>> contents
>>>> +// into small chunks for further processing.
>>>> +//
>>>> +// Note that this function is called from parallel_for_each. This must
>>>> be
>>>> +// thread-safe (i.e. no memory allocation from the pools).
>>>> template <class ELFT> void MergeInputSection<ELFT>::splitIntoPieces()
>>>> {
>>>> ArrayRef<uint8_t> Data = this->Data;
>>>> uintX_t EntSize = this->Entsize;
>>>>
>>>>
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161130/594bc0d0/attachment.html>
More information about the llvm-commits
mailing list