[PATCH] D27152: Merge strings using sharded hash tables.

Fri Dec 9 20:30:33 PST 2016

So, my only comment is that this seems to be a bit too much effort to
optimize string merging in a multi threaded environment. We should
really look into what .dwo gets us and then see if there are still so
many strings left to merge.

Cheers,
Rafael

Rui Ueyama via Phabricator via llvm-commits
<llvm-commits at lists.llvm.org> writes:

> ruiu created this revision.
> ruiu added a reviewer: silvas.
> ruiu added a subscriber: llvm-commits.
>
> This is another attempt to speed up string merging. You want to read
> the description of https://reviews.llvm.org/D27146 first.
>
> In this patch, I took a different approach than the probabilistic
> algorithm used in https://reviews.llvm.org/D27146. Here is the algorithm.
>
> The original code has a single hash table to merge strings. Now we
> have N hash tables where N is a parallelism (currently N=16).
>
> We invoke N threads. Each thread knows its thread index I where
> 0 <= I < N. For each string S in a given string set, thread I adds S
> to its own hash table only if hash(S) % N == I.
>
> When all threads are done, there are N string tables with all
> duplicated strings being merged.
>
> There are pros and cons of this algorithm compared to the
> probabilistic one.
>
> Pros:
>
> - It naturally produces deterministic output.
> - Output is guaranteed to be the smallest.
>
> Cons:
>
> - Slower than the probabilistic algorithm due to the work it needs to do. N threads independently visit all strings, and because the number of mergeable strings it too large, even just skipping them is fairly expensive.
>
>   On the other hand, the probabilistic algorithm doesn't need to skip any element.
> - Unlike the probabilistic algorithm, it degrades performance if the number of available CPU core is smaller than N, because we now have more work to do in total than the original code.
>
>   We can fix this if we are able to know in some way about how many cores are idle.
>
> Here are perf results. The probabilistic algorithm completed the same
> task in 5.227 seconds, so this algorithm is slower than that.
>
>   Before:
>   
>      36095.759481 task-clock (msec)         #    5.539 CPUs utilized            ( +-  0.83% )
>           191,033 context-switches          #    0.005 M/sec                    ( +-  0.22% )
>             8,194 cpu-migrations            #    0.227 K/sec                    ( +- 12.24% )
>         2,342,017 page-faults               #    0.065 M/sec                    ( +-  0.06% )
>    99,758,779,851 cycles                    #    2.764 GHz                      ( +-  0.79% )
>    80,526,137,412 stalled-cycles-frontend   #   80.72% frontend cycles idle     ( +-  0.95% )
>   <not supported> stalled-cycles-backend
>    46,308,518,501 instructions              #    0.46  insns per cycle
>                                             #    1.74  stalled cycles per insn  ( +-  0.12% )
>     8,962,860,074 branches                  #  248.308 M/sec                    ( +-  0.17% )
>       149,264,611 branch-misses             #    1.67% of all branches          ( +-  0.06% )
>   
>       6.517101649 seconds time elapsed                                          ( +-  0.42% )
>   
>   After:
>   
>      45346.098328 task-clock (msec)         #    8.002 CPUs utilized            ( +-  0.77% )
>           165,487 context-switches          #    0.004 M/sec                    ( +-  0.24% )
>             7,455 cpu-migrations            #    0.164 K/sec                    ( +- 11.13% )
>         2,347,870 page-faults               #    0.052 M/sec                    ( +-  0.84% )
>   125,725,992,168 cycles                    #    2.773 GHz                      ( +-  0.76% )
>    96,550,047,016 stalled-cycles-frontend   #   76.79% frontend cycles idle     ( +-  0.89% )
>   <not supported> stalled-cycles-backend
>    79,847,589,597 instructions              #    0.64  insns per cycle
>                                             #    1.21  stalled cycles per insn  ( +-  0.22% )
>    13,569,202,477 branches                  #  299.236 M/sec                    ( +-  0.28% )
>       200,343,507 branch-misses             #    1.48% of all branches          ( +-  0.16% )
>   
>       5.666585908 seconds time elapsed                                          ( +-  0.67% )
>
> To conclude, I lean towards the probabilistic algorithm if we can
> make its output deterministic, since its faster in any sitaution
> (except for pathetic inputs in which our assumption that most
> duplicated strings are spread across inputs doesn't hold.)
>
>
> https://reviews.llvm.org/D27152
>
> Files:
>   ELF/InputSection.h
>   ELF/OutputSections.cpp
>   ELF/OutputSections.h
>
> Index: ELF/OutputSections.h
> ===================================================================
> --- ELF/OutputSections.h
> +++ ELF/OutputSections.h
> @@ -16,12 +16,14 @@
>  #include "lld/Core/LLVM.h"
>  #include "llvm/MC/StringTableBuilder.h"
>  #include "llvm/Object/ELF.h"
> +#include <functional>
>  
>  namespace lld {
>  namespace elf {
>  
>  class SymbolBody;
>  struct EhSectionPiece;
> +struct SectionPiece;
>  template <class ELFT> class EhInputSection;
>  template <class ELFT> class InputSection;
>  template <class ELFT> class InputSectionBase;
> @@ -142,9 +144,12 @@
>  private:
>    void finalizeTailMerge();
>    void finalizeNoTailMerge();
> +  void forEachPiece(
> +      std::function<void(SectionPiece &Piece, llvm::CachedHashStringRef S)> Fn);
>  
>    llvm::StringTableBuilder Builder;
>    std::vector<MergeInputSection<ELFT> *> Sections;
> +  size_t StringAlignment;
>  };
>  
>  struct CieRecord {
> Index: ELF/OutputSections.cpp
> ===================================================================
> --- ELF/OutputSections.cpp
> +++ ELF/OutputSections.cpp
> @@ -21,6 +21,7 @@
>  #include "llvm/Support/MD5.h"
>  #include "llvm/Support/MathExtras.h"
>  #include "llvm/Support/SHA1.h"
> +#include <atomic>
>  
>  using namespace llvm;
>  using namespace llvm::dwarf;
> @@ -470,10 +471,21 @@
>  MergeOutputSection<ELFT>::MergeOutputSection(StringRef Name, uint32_t Type,
>                                               uintX_t Flags, uintX_t Alignment)
>      : OutputSectionBase(Name, Type, Flags),
> -      Builder(StringTableBuilder::RAW, Alignment) {}
> +      Builder(StringTableBuilder::RAW, Alignment), StringAlignment(Alignment) {
> +  assert(Alignment != 0 && isPowerOf2_64(Alignment));
> +}
>  
>  template <class ELFT> void MergeOutputSection<ELFT>::writeTo(uint8_t *Buf) {
> -  Builder.write(Buf);
> +  if (shouldTailMerge()) {
> +    Builder.write(Buf);
> +    return;
> +  }
> +
> +  // Builder is not used for sharded string table construction.
> +  forEachPiece([&](SectionPiece &Piece, CachedHashStringRef S) {
> +    if (Piece.First)
> +      memcpy(Buf + Piece.OutputOff, S.val().data(), S.size());
> +  });
>  }
>  
>  template <class ELFT>
> @@ -524,11 +536,78 @@
>    this->Size = Builder.getSize();
>  }
>  
> +static size_t align2(size_t Val, size_t Alignment) {
> +  return (Val + Alignment - 1) & ~(Alignment - 1);
> +}
> +
> +// Call Fn for each section piece.
> +template <class ELFT>
> +void MergeOutputSection<ELFT>::forEachPiece(
> +    std::function<void(SectionPiece &Piece, CachedHashStringRef S)> Fn) {
> +  for (MergeInputSection<ELFT> *Sec : Sections)
> +    for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
> +      if (Sec->Pieces[I].Live)
> +        Fn(Sec->Pieces[I], Sec->getData(I));
> +}
> +
> +// Split a vector Vec into smaller vectors.
> +template <class T>
> +static std::vector<std::vector<T>> split(std::vector<T> Vec, size_t NumShards) {
> +  std::vector<std::vector<T>> Ret(NumShards);
> +  size_t I = 0;
> +  for (T &Elem : Vec)
> +    Ret[I++ % NumShards].push_back(Elem);
> +  return Ret;
> +}
> +
>  template <class ELFT> void MergeOutputSection<ELFT>::finalize() {
> -  if (shouldTailMerge())
> +  if (shouldTailMerge()) {
>      finalizeTailMerge();
> -  else
> -    finalizeNoTailMerge();
> +    return;
> +  }
> +
> +  const int NumShards = 16;
> +  DenseMap<CachedHashStringRef, size_t> OffsetMap[NumShards];
> +  size_t ShardSize[NumShards];
> +
> +  // Construct NumShards number of string tables in parallel.
> +  parallel_for(0, NumShards, [&](int Idx) {
> +    size_t Offset = 0;
> +    forEachPiece([&](SectionPiece &Piece, CachedHashStringRef S) {
> +      if (S.hash() % NumShards != Idx)
> +        return;
> +
> +      size_t Off = align2(Offset, StringAlignment);
> +      auto P = OffsetMap[Idx].insert({S, Off});
> +      if (P.second) {
> +        Piece.First = true;
> +        Piece.OutputOff = Off;
> +        Offset = Off + S.size();
> +      } else {
> +        Piece.OutputOff = P.first->second;
> +      }
> +    });
> +    ShardSize[Idx] = Offset;
> +  });
> +
> +  // Piece.OutputOff was set independently, so we need to fix it.
> +  // First, we compute starting offset in the string table for each shard.
> +  size_t ShardOffset[NumShards];
> +  ShardOffset[0] = 0;
> +  for (int I = 1; I != NumShards; ++I)
> +    ShardOffset[I] = ShardOffset[I - 1] + ShardSize[I - 1];
> +
> +  // Add a shard starting offset to each section piece.
> +  parallel_for_each(Sections.begin(), Sections.end(),
> +                    [&](MergeInputSection<ELFT> *Sec) {
> +                      for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
> +                        if (Sec->Pieces[I].Live)
> +                          Sec->Pieces[I].OutputOff +=
> +                              ShardOffset[Sec->getData(I).hash() % NumShards];
> +                    });
> +
> +  // Set the size of this output section.
> +  this->Size = ShardOffset[NumShards - 1] + ShardSize[NumShards - 1];
>  }
>  
>  template <class ELFT>
> Index: ELF/InputSection.h
> ===================================================================
> --- ELF/InputSection.h
> +++ ELF/InputSection.h
> @@ -160,11 +160,13 @@
>  // be found by looking at the next one) and put the hash in a side table.
>  struct SectionPiece {
>    SectionPiece(size_t Off, bool Live = false)
> -      : InputOff(Off), OutputOff(-1), Live(Live || !Config->GcSections) {}
> +      : InputOff(Off), Live(Live || !Config->GcSections), OutputOff(-1),
> +        First(false) {}
>  
> -  size_t InputOff;
> -  ssize_t OutputOff : 8 * sizeof(ssize_t) - 1;
> +  size_t InputOff : 8 * sizeof(size_t) - 1;
>    size_t Live : 1;
> +  ssize_t OutputOff : 8 * sizeof(ssize_t) - 1;
> +  size_t First : 1;
>  };
>  static_assert(sizeof(SectionPiece) == 2 * sizeof(size_t),
>                "SectionPiece is too big");
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits