[PATCH] D27152: Merge strings using sharded hash tables.
Rui Ueyama via llvm-commits
llvm-commits at lists.llvm.org
Sat Dec 10 09:26:10 PST 2016
On Fri, Dec 9, 2016 at 8:30 PM, Rafael Avila de Espindola <
rafael.espindola at gmail.com> wrote:
>
> So, my only comment is that this seems to be a bit too much effort to
> optimize string merging in a multi threaded environment. We should
> really look into what .dwo gets us and then see if there are still so
> many strings left to merge.
>
Do you mean you want this patch to not be submitted? Split DWARF is one
good thing, but I think this is also useful for a common use case.
Overall, this patch adds 170 lines and deletes 57 lines. If you subtract
comment lines, this patch adds less than 100 lines. I don't think that's
too complicated.
> Cheers,
> Rafael
>
>
> Rui Ueyama via Phabricator via llvm-commits
> <llvm-commits at lists.llvm.org> writes:
>
> > ruiu created this revision.
> > ruiu added a reviewer: silvas.
> > ruiu added a subscriber: llvm-commits.
> >
> > This is another attempt to speed up string merging. You want to read
> > the description of https://reviews.llvm.org/D27146 first.
> >
> > In this patch, I took a different approach than the probabilistic
> > algorithm used in https://reviews.llvm.org/D27146. Here is the
> algorithm.
> >
> > The original code has a single hash table to merge strings. Now we
> > have N hash tables where N is a parallelism (currently N=16).
> >
> > We invoke N threads. Each thread knows its thread index I where
> > 0 <= I < N. For each string S in a given string set, thread I adds S
> > to its own hash table only if hash(S) % N == I.
> >
> > When all threads are done, there are N string tables with all
> > duplicated strings being merged.
> >
> > There are pros and cons of this algorithm compared to the
> > probabilistic one.
> >
> > Pros:
> >
> > - It naturally produces deterministic output.
> > - Output is guaranteed to be the smallest.
> >
> > Cons:
> >
> > - Slower than the probabilistic algorithm due to the work it needs to
> do. N threads independently visit all strings, and because the number of
> mergeable strings it too large, even just skipping them is fairly expensive.
> >
> > On the other hand, the probabilistic algorithm doesn't need to skip
> any element.
> > - Unlike the probabilistic algorithm, it degrades performance if the
> number of available CPU core is smaller than N, because we now have more
> work to do in total than the original code.
> >
> > We can fix this if we are able to know in some way about how many
> cores are idle.
> >
> > Here are perf results. The probabilistic algorithm completed the same
> > task in 5.227 seconds, so this algorithm is slower than that.
> >
> > Before:
> >
> > 36095.759481 task-clock (msec) # 5.539 CPUs utilized
> ( +- 0.83% )
> > 191,033 context-switches # 0.005 M/sec
> ( +- 0.22% )
> > 8,194 cpu-migrations # 0.227 K/sec
> ( +- 12.24% )
> > 2,342,017 page-faults # 0.065 M/sec
> ( +- 0.06% )
> > 99,758,779,851 cycles # 2.764 GHz
> ( +- 0.79% )
> > 80,526,137,412 stalled-cycles-frontend # 80.72% frontend cycles
> idle ( +- 0.95% )
> > <not supported> stalled-cycles-backend
> > 46,308,518,501 instructions # 0.46 insns per cycle
> > # 1.74 stalled cycles
> per insn ( +- 0.12% )
> > 8,962,860,074 branches # 248.308 M/sec
> ( +- 0.17% )
> > 149,264,611 branch-misses # 1.67% of all branches
> ( +- 0.06% )
> >
> > 6.517101649 seconds time elapsed
> ( +- 0.42% )
> >
> > After:
> >
> > 45346.098328 task-clock (msec) # 8.002 CPUs utilized
> ( +- 0.77% )
> > 165,487 context-switches # 0.004 M/sec
> ( +- 0.24% )
> > 7,455 cpu-migrations # 0.164 K/sec
> ( +- 11.13% )
> > 2,347,870 page-faults # 0.052 M/sec
> ( +- 0.84% )
> > 125,725,992,168 cycles # 2.773 GHz
> ( +- 0.76% )
> > 96,550,047,016 stalled-cycles-frontend # 76.79% frontend cycles
> idle ( +- 0.89% )
> > <not supported> stalled-cycles-backend
> > 79,847,589,597 instructions # 0.64 insns per cycle
> > # 1.21 stalled cycles
> per insn ( +- 0.22% )
> > 13,569,202,477 branches # 299.236 M/sec
> ( +- 0.28% )
> > 200,343,507 branch-misses # 1.48% of all branches
> ( +- 0.16% )
> >
> > 5.666585908 seconds time elapsed
> ( +- 0.67% )
> >
> > To conclude, I lean towards the probabilistic algorithm if we can
> > make its output deterministic, since its faster in any sitaution
> > (except for pathetic inputs in which our assumption that most
> > duplicated strings are spread across inputs doesn't hold.)
> >
> >
> > https://reviews.llvm.org/D27152
> >
> > Files:
> > ELF/InputSection.h
> > ELF/OutputSections.cpp
> > ELF/OutputSections.h
> >
> > Index: ELF/OutputSections.h
> > ===================================================================
> > --- ELF/OutputSections.h
> > +++ ELF/OutputSections.h
> > @@ -16,12 +16,14 @@
> > #include "lld/Core/LLVM.h"
> > #include "llvm/MC/StringTableBuilder.h"
> > #include "llvm/Object/ELF.h"
> > +#include <functional>
> >
> > namespace lld {
> > namespace elf {
> >
> > class SymbolBody;
> > struct EhSectionPiece;
> > +struct SectionPiece;
> > template <class ELFT> class EhInputSection;
> > template <class ELFT> class InputSection;
> > template <class ELFT> class InputSectionBase;
> > @@ -142,9 +144,12 @@
> > private:
> > void finalizeTailMerge();
> > void finalizeNoTailMerge();
> > + void forEachPiece(
> > + std::function<void(SectionPiece &Piece,
> llvm::CachedHashStringRef S)> Fn);
> >
> > llvm::StringTableBuilder Builder;
> > std::vector<MergeInputSection<ELFT> *> Sections;
> > + size_t StringAlignment;
> > };
> >
> > struct CieRecord {
> > Index: ELF/OutputSections.cpp
> > ===================================================================
> > --- ELF/OutputSections.cpp
> > +++ ELF/OutputSections.cpp
> > @@ -21,6 +21,7 @@
> > #include "llvm/Support/MD5.h"
> > #include "llvm/Support/MathExtras.h"
> > #include "llvm/Support/SHA1.h"
> > +#include <atomic>
> >
> > using namespace llvm;
> > using namespace llvm::dwarf;
> > @@ -470,10 +471,21 @@
> > MergeOutputSection<ELFT>::MergeOutputSection(StringRef Name, uint32_t
> Type,
> > uintX_t Flags, uintX_t
> Alignment)
> > : OutputSectionBase(Name, Type, Flags),
> > - Builder(StringTableBuilder::RAW, Alignment) {}
> > + Builder(StringTableBuilder::RAW, Alignment),
> StringAlignment(Alignment) {
> > + assert(Alignment != 0 && isPowerOf2_64(Alignment));
> > +}
> >
> > template <class ELFT> void MergeOutputSection<ELFT>::writeTo(uint8_t
> *Buf) {
> > - Builder.write(Buf);
> > + if (shouldTailMerge()) {
> > + Builder.write(Buf);
> > + return;
> > + }
> > +
> > + // Builder is not used for sharded string table construction.
> > + forEachPiece([&](SectionPiece &Piece, CachedHashStringRef S) {
> > + if (Piece.First)
> > + memcpy(Buf + Piece.OutputOff, S.val().data(), S.size());
> > + });
> > }
> >
> > template <class ELFT>
> > @@ -524,11 +536,78 @@
> > this->Size = Builder.getSize();
> > }
> >
> > +static size_t align2(size_t Val, size_t Alignment) {
> > + return (Val + Alignment - 1) & ~(Alignment - 1);
> > +}
> > +
> > +// Call Fn for each section piece.
> > +template <class ELFT>
> > +void MergeOutputSection<ELFT>::forEachPiece(
> > + std::function<void(SectionPiece &Piece, CachedHashStringRef S)>
> Fn) {
> > + for (MergeInputSection<ELFT> *Sec : Sections)
> > + for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
> > + if (Sec->Pieces[I].Live)
> > + Fn(Sec->Pieces[I], Sec->getData(I));
> > +}
> > +
> > +// Split a vector Vec into smaller vectors.
> > +template <class T>
> > +static std::vector<std::vector<T>> split(std::vector<T> Vec, size_t
> NumShards) {
> > + std::vector<std::vector<T>> Ret(NumShards);
> > + size_t I = 0;
> > + for (T &Elem : Vec)
> > + Ret[I++ % NumShards].push_back(Elem);
> > + return Ret;
> > +}
> > +
> > template <class ELFT> void MergeOutputSection<ELFT>::finalize() {
> > - if (shouldTailMerge())
> > + if (shouldTailMerge()) {
> > finalizeTailMerge();
> > - else
> > - finalizeNoTailMerge();
> > + return;
> > + }
> > +
> > + const int NumShards = 16;
> > + DenseMap<CachedHashStringRef, size_t> OffsetMap[NumShards];
> > + size_t ShardSize[NumShards];
> > +
> > + // Construct NumShards number of string tables in parallel.
> > + parallel_for(0, NumShards, [&](int Idx) {
> > + size_t Offset = 0;
> > + forEachPiece([&](SectionPiece &Piece, CachedHashStringRef S) {
> > + if (S.hash() % NumShards != Idx)
> > + return;
> > +
> > + size_t Off = align2(Offset, StringAlignment);
> > + auto P = OffsetMap[Idx].insert({S, Off});
> > + if (P.second) {
> > + Piece.First = true;
> > + Piece.OutputOff = Off;
> > + Offset = Off + S.size();
> > + } else {
> > + Piece.OutputOff = P.first->second;
> > + }
> > + });
> > + ShardSize[Idx] = Offset;
> > + });
> > +
> > + // Piece.OutputOff was set independently, so we need to fix it.
> > + // First, we compute starting offset in the string table for each
> shard.
> > + size_t ShardOffset[NumShards];
> > + ShardOffset[0] = 0;
> > + for (int I = 1; I != NumShards; ++I)
> > + ShardOffset[I] = ShardOffset[I - 1] + ShardSize[I - 1];
> > +
> > + // Add a shard starting offset to each section piece.
> > + parallel_for_each(Sections.begin(), Sections.end(),
> > + [&](MergeInputSection<ELFT> *Sec) {
> > + for (size_t I = 0, E = Sec->Pieces.size(); I !=
> E; ++I)
> > + if (Sec->Pieces[I].Live)
> > + Sec->Pieces[I].OutputOff +=
> > + ShardOffset[Sec->getData(I).hash() %
> NumShards];
> > + });
> > +
> > + // Set the size of this output section.
> > + this->Size = ShardOffset[NumShards - 1] + ShardSize[NumShards - 1];
> > }
> >
> > template <class ELFT>
> > Index: ELF/InputSection.h
> > ===================================================================
> > --- ELF/InputSection.h
> > +++ ELF/InputSection.h
> > @@ -160,11 +160,13 @@
> > // be found by looking at the next one) and put the hash in a side
> table.
> > struct SectionPiece {
> > SectionPiece(size_t Off, bool Live = false)
> > - : InputOff(Off), OutputOff(-1), Live(Live || !Config->GcSections)
> {}
> > + : InputOff(Off), Live(Live || !Config->GcSections), OutputOff(-1),
> > + First(false) {}
> >
> > - size_t InputOff;
> > - ssize_t OutputOff : 8 * sizeof(ssize_t) - 1;
> > + size_t InputOff : 8 * sizeof(size_t) - 1;
> > size_t Live : 1;
> > + ssize_t OutputOff : 8 * sizeof(ssize_t) - 1;
> > + size_t First : 1;
> > };
> > static_assert(sizeof(SectionPiece) == 2 * sizeof(size_t),
> > "SectionPiece is too big");
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161210/1433eda8/attachment.html>
More information about the llvm-commits
mailing list