[PATCH] D27146: Merge strings using a probabilistic algorithm to reduce latency.

Wed Dec 7 09:18:24 PST 2016

Sorry for being late on the thread, I was in a business trip last week.

Given that the vast majority of strings are from debug info, I would
like for us to investigate the impact of debug fission before going as
far as implementing probabilistic merging. If debug fission can avoid
the need for copying and merging so many strings than that is probably
the best thing to do.

Thanks,
Rafael

Rui Ueyama via Phabricator <reviews at reviews.llvm.org> writes:

> ruiu created this revision.
> ruiu added reviewers: rafael, silvas.
> ruiu added a subscriber: llvm-commits.
>
> I'm sending this patch to get fedback. I haven't convince even myself
> that this is the right thing to do. But this should be interesting
> to those who want to see what we can do to improve linker's latency.
>
> String merging is one of the slowest passes in LLD because of the
> sheer number of mergeable strings. For example, Clang with debug info
> contains 30 millions of mergeable strings (average length is about 50
> bytes). They need to be uniquified, and uniquified strings need to
> get consecutive offsets in the resulting string table.
>
> Currently, we are using a (single-threaded, regular) dense map for
> string unification. Merging the 30 million strings takes about 2
> seconds on my machine.
>
> This patch implements one of my ideas about how to reduce latency by
> parallelizing it. This algorithm is probabilistic, meaining that
> even though duplicated strings are likely to be merged, that's not
> guaranteed. As a result, it produces larger string table quickly.
> (If you need to optimize in size, you could still pass -O2 which
> does tail-merging.)
>
> Here's how it works.
>
> In the first step, we take 10% of input string set to create a small
> string table. The resulting string table is very unlikely to contain
> all strings of the entire set, but it is likely to contain most of
> duplicated strings, because duplicated strings are repeated many times.
>
> The second step processes the remaining 90% in parallel. In this step,
> we do not merge strings. So, if a string is not in the small string
> table we created in the first step, that will just be appended to end
> of the string table. This step completes the string table.
>
> Here are some numbers of resulting clang executables:
>
>   Size of .debug_str section:
>   Current            108,049,822   (+0%)
>   Probabilistic      154,089,550   (+42.6%)
>   No string merging  1,591,388,940 (+1472.8%)
>   
>   Size of resulting file:
>   Current            1,440,453,528 (+0%)
>   Probabilistic      1,490,597,448 (+3.5%)
>   No string merging  2,945,020,808 (+204.5%)
>
> The probabilistic algorithm produces larger string table, but that's
> much smaller than that without string merging. Compared to the entire
> executable size, the loss is only 3.5%.
>
> Here is a speedup in latency:
>
>   Before:
>   
>      36098.025468 task-clock (msec)         #    5.256 CPUs utilized            ( +-  0.95% )
>           190,770 context-switches          #    0.005 M/sec                    ( +-  0.25% )
>             7,609 cpu-migrations            #    0.211 K/sec                    ( +- 11.40% )
>         2,378,416 page-faults               #    0.066 M/sec                    ( +-  0.07% )
>    99,645,202,279 cycles                    #    2.760 GHz                      ( +-  0.94% )
>    81,128,226,367 stalled-cycles-frontend   #   81.42% frontend cycles idle     ( +-  1.10% )
>   <not supported> stalled-cycles-backend
>    45,662,681,567 instructions              #    0.46  insns per cycle
>                                             #    1.78  stalled cycles per insn  ( +-  0.14% )
>     8,864,616,311 branches                  #  245.571 M/sec                    ( +-  0.22% )
>       146,360,227 branch-misses             #    1.65% of all branches          ( +-  0.06% )
>   
>       6.868559257 seconds time elapsed                                          ( +-  0.50% )
>   
>   After:
>   
>      36905.733802 task-clock (msec)         #    7.061 CPUs utilized            ( +-  0.84% )
>           159,813 context-switches          #    0.004 M/sec                    ( +-  0.24% )
>             8,079 cpu-migrations            #    0.219 K/sec                    ( +- 12.67% )
>         2,296,298 page-faults               #    0.062 M/sec                    ( +-  0.21% )
>   102,178,380,224 cycles                    #    2.769 GHz                      ( +-  0.83% )
>    83,846,653,367 stalled-cycles-frontend   #   82.06% frontend cycles idle     ( +-  0.96% )
>   <not supported> stalled-cycles-backend
>    46,138,345,206 instructions              #    0.45  insns per cycle
>                                             #    1.82  stalled cycles per insn  ( +-  0.15% )
>     8,824,763,690 branches                  #  239.116 M/sec                    ( +-  0.24% )
>       142,482,338 branch-misses             #    1.61% of all branches          ( +-  0.05% )
>   
>       5.227024403 seconds time elapsed                                          ( +-  0.43% )
>
> In terms of latency, this algorithm is a clear win.
>
> With these results, I have a feeling that this algorithm could be
> a reasonable addition to LLD. Only for a few percent of loss in size,
> it reduces latency by about 25%, so it might be a good option for
> daily edit-build-test cycles (on the other hand, disabling string
> merging with -O0 creates 2x larger executables, which is sometimes
> inconvenient even for daily development cycle.) You can still pass
> -O2 to produce production binaries.
>
> I have another idea to reduce string merging latency, so I'll
> implement that later for comparison.
>
>
> https://reviews.llvm.org/D27146
>
> Files:
>   ELF/InputSection.h
>   ELF/OutputSections.cpp
>   ELF/OutputSections.h
>
> Index: ELF/OutputSections.h
> ===================================================================
> --- ELF/OutputSections.h
> +++ ELF/OutputSections.h
> @@ -16,12 +16,14 @@
>  #include "lld/Core/LLVM.h"
>  #include "llvm/MC/StringTableBuilder.h"
>  #include "llvm/Object/ELF.h"
> +#include <functional>
>  
>  namespace lld {
>  namespace elf {
>  
>  class SymbolBody;
>  struct EhSectionPiece;
> +struct SectionPiece;
>  template <class ELFT> class EhInputSection;
>  template <class ELFT> class InputSection;
>  template <class ELFT> class InputSectionBase;
> @@ -142,9 +144,13 @@
>  private:
>    void finalizeTailMerge();
>    void finalizeNoTailMerge();
> +  void forEachPiece(
> +      ArrayRef<MergeInputSection<ELFT> *> Sections,
> +      std::function<void(SectionPiece &Piece, llvm::CachedHashStringRef S)> Fn);
>  
>    llvm::StringTableBuilder Builder;
>    std::vector<MergeInputSection<ELFT> *> Sections;
> +  size_t StringAlignment;
>  };
>  
>  struct CieRecord {
> Index: ELF/OutputSections.cpp
> ===================================================================
> --- ELF/OutputSections.cpp
> +++ ELF/OutputSections.cpp
> @@ -21,6 +21,7 @@
>  #include "llvm/Support/MD5.h"
>  #include "llvm/Support/MathExtras.h"
>  #include "llvm/Support/SHA1.h"
> +#include <atomic>
>  
>  using namespace llvm;
>  using namespace llvm::dwarf;
> @@ -470,10 +471,26 @@
>  MergeOutputSection<ELFT>::MergeOutputSection(StringRef Name, uint32_t Type,
>                                               uintX_t Flags, uintX_t Alignment)
>      : OutputSectionBase(Name, Type, Flags),
> -      Builder(StringTableBuilder::RAW, Alignment) {}
> +      Builder(StringTableBuilder::RAW, Alignment), StringAlignment(Alignment) {
> +  assert(Alignment != 0 && isPowerOf2_64(Alignment));
> +}
>  
>  template <class ELFT> void MergeOutputSection<ELFT>::writeTo(uint8_t *Buf) {
> -  Builder.write(Buf);
> +  if (shouldTailMerge()) {
> +    Builder.write(Buf);
> +    return;
> +  }
> +
> +  // Builder is not used for probabilistic string merging.
> +  for (MergeInputSection<ELFT> *Sec : Sections) {
> +    for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I) {
> +      SectionPiece &Piece = Sec->Pieces[I];
> +      if (!Piece.Live || !Piece.First)
> +        continue;
> +      StringRef S = Sec->getData(I).val();
> +      memcpy(Buf + Piece.OutputOff, S.data(), S.size());
> +    }
> +  }
>  }
>  
>  template <class ELFT>
> @@ -524,11 +541,103 @@
>    this->Size = Builder.getSize();
>  }
>  
> +static size_t align2(size_t Val, size_t Alignment) {
> +  return (Val + Alignment - 1) & ~(Alignment - 1);
> +}
> +
> +// Call Fn for each section piece.
> +template <class ELFT>
> +void MergeOutputSection<ELFT>::forEachPiece(
> +    ArrayRef<MergeInputSection<ELFT> *> Sections,
> +    std::function<void(SectionPiece &Piece, CachedHashStringRef S)> Fn) {
> +  for (MergeInputSection<ELFT> *Sec : Sections)
> +    for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I)
> +      if (Sec->Pieces[I].Live)
> +        Fn(Sec->Pieces[I], Sec->getData(I));
> +}
> +
> +// Split a vector Vec into smaller vectors.
> +template <class T>
> +static std::vector<std::vector<T>> split(std::vector<T> Vec, size_t NumShards) {
> +  std::vector<std::vector<T>> Ret(NumShards);
> +  size_t I = 0;
> +  for (T &Elem : Vec)
> +    Ret[I++ % NumShards].push_back(Elem);
> +  return Ret;
> +}
> +
>  template <class ELFT> void MergeOutputSection<ELFT>::finalize() {
> -  if (shouldTailMerge())
> +  if (shouldTailMerge()) {
>      finalizeTailMerge();
> -  else
> -    finalizeNoTailMerge();
> +    return;
> +  }
> +
> +  // This implements a probabilistic string merging algorithm.
> +  //
> +  // In this function, we merge identical strings and assign contiguous
> +  // offsets to unique strings to create a string table. This is one of the
> +  // most time-consuming passes in LLD because of the sheer number of strings.
> +  // When we link large programs such as Clang with debug info, we need to
> +  // merge thousands of sections containing millions of string pieces.
> +  // On my Xeon 2.8 GHz machine, merging 30 million strings (average length
> +  // is about 50 bytes) using a single-threaded hash table takes about 2
> +  // seconds.
> +  //
> +  // The probabilistic algorithm improves the latency to 300 milliseconds by
> +  // allowing a loss in output size. In other words, this algorithm faster
> +  // but produces larger string table. If you need to optimize in size,
> +  // you should pass -O2 to LLD.
> +  //
> +  // Here's how it works. In the first step, we take 10% of input string set
> +  // to create a small string table. The resulting string table is very
> +  // unlikely to contain all strings of the entire set, but it is likely to
> +  // contain most of duplicated strings, because duplicated strings are
> +  // repeated many times.
> +  //
> +  // The second step processes the remaining 90% in parallel. In this step,
> +  // we do not merge strings. So, if a string is not in the small string
> +  // table we created in the first step, that will be just appended to end
> +  // of the string table. This step completes the string table.
> +
> +  DenseMap<CachedHashStringRef, size_t> OffsetMap;
> +  std::atomic<size_t> Offset;
> +  Offset.store(0);
> +
> +  size_t NumShards = 100;
> +  std::vector<std::vector<MergeInputSection<ELFT> *>> Shards =
> +      split(Sections, NumShards);
> +
> +  // Step 1: construct a small string table
> +  for (size_t I = 0; I < NumShards / 10; ++I) {
> +    forEachPiece(Shards[I], [&](SectionPiece &Piece, CachedHashStringRef S) {
> +      auto P = OffsetMap.insert({S, Offset.load()});
> +      if (P.second) {
> +        Piece.First = true;
> +        Piece.OutputOff = Offset.load();
> +        Offset.store(align2(Offset.load() + S.size(), StringAlignment));
> +      } else {
> +        Piece.OutputOff = P.first->second;
> +      }
> +    });
> +  }
> +
> +  // Step 2: append remaining strings
> +  parallel_for_each(
> +      Shards.begin() + NumShards / 10, Shards.end(),
> +      [&](ArrayRef<MergeInputSection<ELFT> *> Sections) {
> +        forEachPiece(Sections, [&](SectionPiece &Piece, CachedHashStringRef S) {
> +          auto It = OffsetMap.find(S);
> +          if (It == OffsetMap.end()) {
> +            Piece.First = true;
> +            size_t Size = align2(S.size(), StringAlignment);
> +            Piece.OutputOff = Offset.fetch_add(Size);
> +          } else {
> +            Piece.OutputOff = It->second;
> +          }
> +        });
> +      });
> +
> +  this->Size = Offset.load();
>  }
>  
>  template <class ELFT>
> Index: ELF/InputSection.h
> ===================================================================
> --- ELF/InputSection.h
> +++ ELF/InputSection.h
> @@ -160,11 +160,13 @@
>  // be found by looking at the next one) and put the hash in a side table.
>  struct SectionPiece {
>    SectionPiece(size_t Off, bool Live = false)
> -      : InputOff(Off), OutputOff(-1), Live(Live || !Config->GcSections) {}
> +      : InputOff(Off), Live(Live || !Config->GcSections), OutputOff(-1),
> +        First(false) {}
>  
> -  size_t InputOff;
> -  ssize_t OutputOff : 8 * sizeof(ssize_t) - 1;
> +  size_t InputOff : 8 * sizeof(size_t) - 1;
>    size_t Live : 1;
> +  ssize_t OutputOff : 8 * sizeof(ssize_t) - 1;
> +  size_t First : 1;
>  };
>  static_assert(sizeof(SectionPiece) == 2 * sizeof(size_t),
>                "SectionPiece is too big");