[PATCH] D27152: Merge strings using sharded hash tables.

Sat Dec 10 09:26:10 PST 2016

On Fri, Dec 9, 2016 at 8:30 PM, Rafael Avila de Espindola <
rafael.espindola at gmail.com> wrote:

>
> So, my only comment is that this seems to be a bit too much effort to
> optimize string merging in a multi threaded environment. We should
> really look into what .dwo gets us and then see if there are still so
> many strings left to merge.
>

Do you mean you want this patch to not be submitted? Split DWARF is one
good thing, but I think this is also useful for a common use case.

Overall, this patch adds 170 lines and deletes 57 lines. If you subtract
comment lines, this patch adds less than 100 lines. I don't think that's
too complicated.

> Cheers,
> Rafael
>
>
> Rui Ueyama via Phabricator via llvm-commits
> <llvm-commits at lists.llvm.org> writes:
>
> > ruiu created this revision.
> > ruiu added a reviewer: silvas.
> > ruiu added a subscriber: llvm-commits.
> >
> > This is another attempt to speed up string merging. You want to read
> > the description of https://reviews.llvm.org/D27146 first.
> >
> > In this patch, I took a different approach than the probabilistic
> > algorithm used in https://reviews.llvm.org/D27146. Here is the
> algorithm.
> >
> > The original code has a single hash table to merge strings. Now we
> > have N hash tables where N is a parallelism (currently N=16).
> >
> > We invoke N threads. Each thread knows its thread index I where
> > 0 <= I < N. For each string S in a given string set, thread I adds S
> > to its own hash table only if hash(S) % N == I.
> >
> > When all threads are done, there are N string tables with all
> > duplicated strings being merged.
> >
> > There are pros and cons of this algorithm compared to the
> > probabilistic one.
> >
> > Pros:
> >
> > - It naturally produces deterministic output.
> > - Output is guaranteed to be the smallest.
> >
> > Cons:
> >
> > - Slower than the probabilistic algorithm due to the work it needs to
> do. N threads independently visit all strings, and because the number of
> mergeable strings it too large, even just skipping them is fairly expensive.
> >
> >   On the other hand, the probabilistic algorithm doesn't need to skip
> any element.
> > - Unlike the probabilistic algorithm, it degrades performance if the
> number of available CPU core is smaller than N, because we now have more
> work to do in total than the original code.
> >
> >   We can fix this if we are able to know in some way about how many
> cores are idle.
> >
> > Here are perf results. The probabilistic algorithm completed the same
> > task in 5.227 seconds, so this algorithm is slower than that.
> >
> >   Before:
> >
> >      36095.759481 task-clock (msec)         #    5.539 CPUs utilized
>         ( +-  0.83% )
> >           191,033 context-switches          #    0.005 M/sec
>         ( +-  0.22% )
> >             8,194 cpu-migrations            #    0.227 K/sec
>         ( +- 12.24% )
> >         2,342,017 page-faults               #    0.065 M/sec
>         ( +-  0.06% )
> >    99,758,779,851 cycles                    #    2.764 GHz
>         ( +-  0.79% )
> >    80,526,137,412 stalled-cycles-frontend   #   80.72% frontend cycles
> idle     ( +-  0.95% )
> >   <not supported> stalled-cycles-backend
> >    46,308,518,501 instructions              #    0.46  insns per cycle
> >                                             #    1.74  stalled cycles
> per insn  ( +-  0.12% )
> >     8,962,860,074 branches                  #  248.308 M/sec
>         ( +-  0.17% )
> >       149,264,611 branch-misses             #    1.67% of all branches
>         ( +-  0.06% )
> >
> >       6.517101649 seconds time elapsed
>         ( +-  0.42% )
> >
> >   After:
> >
> >      45346.098328 task-clock (msec)         #    8.002 CPUs utilized
>         ( +-  0.77% )
> >           165,487 context-switches          #    0.004 M/sec
>         ( +-  0.24% )
> >             7,455 cpu-migrations            #    0.164 K/sec
>         ( +- 11.13% )
> >         2,347,870 page-faults               #    0.052 M/sec
>         ( +-  0.84% )
> >   125,725,992,168 cycles                    #    2.773 GHz
>         ( +-  0.76% )
> >    96,550,047,016 stalled-cycles-frontend   #   76.79% frontend cycles
> idle     ( +-  0.89% )
> >   <not supported> stalled-cycles-backend
> >    79,847,589,597 instructions              #    0.64  insns per cycle
> >                                             #    1.21  stalled cycles
> per insn  ( +-  0.22% )
> >    13,569,202,477 branches                  #  299.236 M/sec
>         ( +-  0.28% )
> >       200,343,507 branch-misses             #    1.48% of all branches
>         ( +-  0.16% )
> >
> >       5.666585908 seconds time elapsed
>         ( +-  0.67% )
> >
> > To conclude, I lean towards the probabilistic algorithm if we can
> > make its output deterministic, since its faster in any sitaution
> > (except for pathetic inputs in which our assumption that most
> > duplicated strings are spread across inputs doesn't hold.)
> >
> >
> > https://reviews.llvm.org/D27152
> >
> > Files:
> >   ELF/InputSection.h
> >   ELF/OutputSections.cpp
> >   ELF/OutputSections.h
> >
> > Index: ELF/OutputSections.h
> > ===================================================================
> > --- ELF/OutputSections.h
> > +++ ELF/OutputSections.h
> > @@ -16,12 +16,14 @@
> >  #include "lld/Core/LLVM.h"
> >  #include "llvm/MC/StringTableBuilder.h"
> >  #include "llvm/Object/ELF.h"
> > +#include <functional>
> >
> >  namespace lld {
> >  namespace elf {
> >
> >  class SymbolBody;
> >  struct EhSectionPiece;
> > +struct SectionPiece;
> >  template <class ELFT> class EhInputSection;
> >  template <class ELFT> class InputSection;
> >  template <class ELFT> class InputSectionBase;
> > @@ -142,9 +144,12 @@
> >  private:
> >    void finalizeTailMerge();
> >    void finalizeNoTailMerge();
> > +  void forEachPiece(
> > +      std::function<void(SectionPiece &Piece,
> llvm::CachedHashStringRef S)> Fn);
> >
> >    llvm::StringTableBuilder Builder;
> >    std::vector<MergeInputSection<ELFT> *> Sections;
> > +  size_t StringAlignment;
> >  };
> >
> >  struct CieRecord {
> > Index: ELF/OutputSections.cpp
> > ===================================================================
> > --- ELF/OutputSections.cpp
> > +++ ELF/OutputSections.cpp
> > @@ -21,6 +21,7 @@
> >  #include "llvm/Support/MD5.h"
> >  #include "llvm/Support/MathExtras.h"
> >  #include "llvm/Support/SHA1.h"
> > +#include <atomic>
> >
> >  using namespace llvm;
> >  using namespace llvm::dwarf;
> > @@ -470,10 +471,21 @@
> >  MergeOutputSection<ELFT>::MergeOutputSection(StringRef Name, uint32_t
> Type,
> >                                               uintX_t Flags, uintX_t
> Alignment)
> >      : OutputSectionBase(Name, Type, Flags),
> > -      Builder(StringTableBuilder::RAW, Alignment) {}
> > +      Builder(StringTableBuilder::RAW, Alignment),
> StringAlignment(Alignment) {
> > +  assert(Alignment != 0 && isPowerOf2_64(Alignment));
> > +}
> >
> >  template <class ELFT> void MergeOutputSection<ELFT>::writeTo(uint8_t
> *Buf) {
> > -  Builder.write(Buf);
> > +  if (shouldTailMerge()) {
> > +    Builder.write(Buf);
> > +    return;
> > +  }
> > +
> > +  // Builder is not used for sharded string table construction.
> > +  forEachPiece([&](SectionPiece &Piece, CachedHashStringRef S) {
> > +    if (Piece.First)
> > +      memcpy(Buf + Piece.OutputOff, S.val().data(), S.size());
> > +  });
> >  }
> >
> >  template <class ELFT>
> > @@ -524,11 +536,78 @@
> >    this->Size = Builder.getSize();
> >  }
> >
> > +static size_t align2(size_t Val, size_t Alignment) {
> > +  return (Val + Alignment - 1) & ~(Alignment - 1);
> > +}
> > +
> > +// Call Fn for each section piece.
> > +template <class ELFT>
> > +void MergeOutputSection<ELFT>::forEachPiece(
> > +    std::function<void(SectionPiece &Piece, CachedHashStringRef S)>
> Fn) {
> > +  for (MergeInputSection<ELFT> *Sec : Sections)
> > +    for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
> > +      if (Sec->Pieces[I].Live)
> > +        Fn(Sec->Pieces[I], Sec->getData(I));
> > +}
> > +
> > +// Split a vector Vec into smaller vectors.
> > +template <class T>
> > +static std::vector<std::vector<T>> split(std::vector<T> Vec, size_t
> NumShards) {
> > +  std::vector<std::vector<T>> Ret(NumShards);
> > +  size_t I = 0;
> > +  for (T &Elem : Vec)
> > +    Ret[I++ % NumShards].push_back(Elem);
> > +  return Ret;
> > +}
> > +
> >  template <class ELFT> void MergeOutputSection<ELFT>::finalize() {
> > -  if (shouldTailMerge())
> > +  if (shouldTailMerge()) {
> >      finalizeTailMerge();
> > -  else
> > -    finalizeNoTailMerge();
> > +    return;
> > +  }
> > +
> > +  const int NumShards = 16;
> > +  DenseMap<CachedHashStringRef, size_t> OffsetMap[NumShards];
> > +  size_t ShardSize[NumShards];
> > +
> > +  // Construct NumShards number of string tables in parallel.
> > +  parallel_for(0, NumShards, [&](int Idx) {
> > +    size_t Offset = 0;
> > +    forEachPiece([&](SectionPiece &Piece, CachedHashStringRef S) {
> > +      if (S.hash() % NumShards != Idx)
> > +        return;
> > +
> > +      size_t Off = align2(Offset, StringAlignment);
> > +      auto P = OffsetMap[Idx].insert({S, Off});
> > +      if (P.second) {
> > +        Piece.First = true;
> > +        Piece.OutputOff = Off;
> > +        Offset = Off + S.size();
> > +      } else {
> > +        Piece.OutputOff = P.first->second;
> > +      }
> > +    });
> > +    ShardSize[Idx] = Offset;
> > +  });
> > +
> > +  // Piece.OutputOff was set independently, so we need to fix it.
> > +  // First, we compute starting offset in the string table for each
> shard.
> > +  size_t ShardOffset[NumShards];
> > +  ShardOffset[0] = 0;
> > +  for (int I = 1; I != NumShards; ++I)
> > +    ShardOffset[I] = ShardOffset[I - 1] + ShardSize[I - 1];
> > +
> > +  // Add a shard starting offset to each section piece.
> > +  parallel_for_each(Sections.begin(), Sections.end(),
> > +                    [&](MergeInputSection<ELFT> *Sec) {
> > +                      for (size_t I = 0, E = Sec->Pieces.size(); I !=
> E; ++I)
> > +                        if (Sec->Pieces[I].Live)
> > +                          Sec->Pieces[I].OutputOff +=
> > +                              ShardOffset[Sec->getData(I).hash() %
> NumShards];
> > +                    });
> > +
> > +  // Set the size of this output section.
> > +  this->Size = ShardOffset[NumShards - 1] + ShardSize[NumShards - 1];
> >  }
> >
> >  template <class ELFT>
> > Index: ELF/InputSection.h
> > ===================================================================
> > --- ELF/InputSection.h
> > +++ ELF/InputSection.h
> > @@ -160,11 +160,13 @@
> >  // be found by looking at the next one) and put the hash in a side
> table.
> >  struct SectionPiece {
> >    SectionPiece(size_t Off, bool Live = false)
> > -      : InputOff(Off), OutputOff(-1), Live(Live || !Config->GcSections)
> {}
> > +      : InputOff(Off), Live(Live || !Config->GcSections), OutputOff(-1),
> > +        First(false) {}
> >
> > -  size_t InputOff;
> > -  ssize_t OutputOff : 8 * sizeof(ssize_t) - 1;
> > +  size_t InputOff : 8 * sizeof(size_t) - 1;
> >    size_t Live : 1;
> > +  ssize_t OutputOff : 8 * sizeof(ssize_t) - 1;
> > +  size_t First : 1;
> >  };
> >  static_assert(sizeof(SectionPiece) == 2 * sizeof(size_t),
> >                "SectionPiece is too big");
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161210/1433eda8/attachment.html>