[lld] r288373 - Parallelize ICF to make LLD's ICF really fast.

Wed Dec 7 10:32:25 PST 2016

Thank you for testing this. This algorithm is lock-free (and even
wait-free), so it should scale well.

On Wed, Dec 7, 2016 at 10:22 AM, Rafael Avila de Espindola <
rafael.espindola at gmail.com> wrote:

>
> Impressive.
>
> In sigle thread mode the link of chromium goes from 4.140444332 to
> 4.244436379 seconds.
>
> With multiple threads and 2 cores available, the difference is already
> from 4.161336277 to 3.340861790.
>
> With 4 cores it goes from 4.092804448 to 2.921354616 and with 8 from
> 4.047472207 to 2.754971966.
>
> Cheers,
> Rafael
>
>
> Rui Ueyama via llvm-commits <llvm-commits at lists.llvm.org> writes:
>
> > Author: ruiu
> > Date: Thu Dec  1 11:09:04 2016
> > New Revision: 288373
> >
> > URL: http://llvm.org/viewvc/llvm-project?rev=288373&view=rev
> > Log:
> > Parallelize ICF to make LLD's ICF really fast.
> >
> > ICF is short for Identical Code Folding. It is a size optimization to
> > identify two or more functions that happened to have the same contents
> > to merges them. It usually reduces output size by a few percent.
> >
> > ICF is slow because it is computationally intensive process. I tried
> > to paralellize it before but failed because I couldn't make a
> > parallelized version produce consistent outputs. Although it didn't
> > create broken executables, every invocation of the linker generated
> > slightly different output, and I couldn't figure out why.
> >
> > I think I now understand what was going on, and also came up with a
> > simple algorithm to fix it. So is this patch.
> >
> > The result is very exciting. Chromium for example has 780,662 input
> > sections in which 20,774 are reducible by ICF. LLD previously took
> > 7.980 seconds for ICF. Now it finishes in 1.065 seconds.
> >
> > As a result, LLD can now link a Chromium binary (output size 1.59 GB)
> > in 10.28 seconds on my machine with ICF enabled. Compared to gold
> > which takes 40.94 seconds to do the same thing, this is an amazing
> > number.
> >
> > From here, I'll describe what we are doing for ICF, what was the
> > previous problem, and what I did in this patch.
> >
> > In ICF, two sections are considered identical if they have the same
> > section flags, section data, and relocations. Relocations are tricky,
> > becuase two relocations are considered the same if they have the same
> > relocation type, values, and if they point to the same section _in
> > terms of ICF_.
> >
> > Here is an example. If foo and bar defined below are compiled to the
> > same machine instructions, ICF can (and should) merge the two,
> > although their relocations point to each other.
> >
> >   void foo() { bar(); }
> >   void bar() { foo(); }
> >
> > This is not an easy problem to solve.
> >
> > What we are doing in LLD is some sort of coloring algorithm. We color
> > non-identical sections using different colors repeatedly, and sections
> > in the same color when the algorithm terminates are considered
> > identical. Here is the details:
> >
> >   1. First, we color all sections using their hash values of section
> >   types, section contents, and numbers of relocations. At this moment,
> >   relocation targets are not taken into account. We just color
> >   sections that apparently differ in different colors.
> >
> >   2. Next, for each color C, we visit sections having color C to see
> >   if their relocations are the same. Relocations are considered equal
> >   if their targets have the same color. We then recolor sections that
> >   have different relocation targets in new colors.
> >
> >   3. If we recolor some section in step 2, relocations that were
> >   previously pointing to the same color targets may now be pointing to
> >   different colors. Therefore, repeat 2 until a convergence is
> >   obtained.
> >
> > Step 2 is a heavy operation. For Chromium, the first iteration of step
> > 2 takes 2.882 seconds, and the second iteration takes 1.038 seconds,
> > and in total it needs 23 iterations.
> >
> > Parallelizing step 1 is easy because we can color each section
> > independently. This patch does that.
> >
> > Parallelizing step 2 is tricky. We could work on each color
> > independently, but we cannot recolor sections in place, because it
> > will break the invariance that two possibly-identical sections must
> > have the same color at any moment.
> >
> > Consider sections S1, S2, S3, S4 in the same color C, where S1 and S2
> > are identical, S3 and S4 are identical, but S2 and S3 are not. Thread
> > A is about to recolor S1 and S2 in C'. After thread A recolor S1 in
> > C', but before recolor S2 in C', other thread B might observe S1 and
> > S2. Then thread B will conclude that S1 and S2 are different, and it
> > will split thread B's sections into smaller groups wrongly. Over-
> > splitting doesn't produce broken results, but it loses a chance to
> > merge some identical sections. That was the cause of indeterminism.
> >
> > To fix the problem, I made sections have two colors, namely current
> > color and next color. At the beginning of each iteration, both colors
> > are the same. Each thread reads from current color and writes to next
> > color. In this way, we can avoid threads from reading partial
> > results. After each iteration, we flip current and next.
> >
> > This is a very simple solution and is implemented in less than 50
> > lines of code.
> >
> > I tested this patch with Chromium and confirmed that this parallelized
> > ICF produces the identical output as the non-parallelized one.
> >
> > Differential Revision: https://reviews.llvm.org/D27247
> >
> > Modified:
> >     lld/trunk/ELF/ICF.cpp
> >     lld/trunk/ELF/InputSection.h
> >
> > Modified: lld/trunk/ELF/ICF.cpp
> > URL: http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/ICF.cpp?
> rev=288373&r1=288372&r2=288373&view=diff
> > ============================================================
> ==================
> > --- lld/trunk/ELF/ICF.cpp (original)
> > +++ lld/trunk/ELF/ICF.cpp Thu Dec  1 11:09:04 2016
> > @@ -59,10 +59,12 @@
> >  #include "Config.h"
> >  #include "SymbolTable.h"
> >
> > +#include "lld/Core/Parallel.h"
> >  #include "llvm/ADT/Hashing.h"
> >  #include "llvm/Object/ELF.h"
> >  #include "llvm/Support/ELF.h"
> >  #include <algorithm>
> > +#include <mutex>
> >
> >  using namespace lld;
> >  using namespace lld::elf;
> > @@ -95,16 +97,16 @@ private:
> >
> >    std::vector<InputSection<ELFT> *> Sections;
> >    std::vector<Range> Ranges;
> > +  std::mutex Mu;
> >
> > -  // The main loop is repeated until we get a convergence.
> > -  bool Repeat = false; // If Repeat is true, we need to repeat.
> > -  int Cnt = 0;         // Counter for the main loop.
> > +  uint32_t NextId = 1;
> > +  int Cnt = 0;
> >  };
> >  }
> >
> >  // Returns a hash value for S. Note that the information about
> >  // relocation targets is not included in the hash value.
> > -template <class ELFT> static uint64_t getHash(InputSection<ELFT> *S) {
> > +template <class ELFT> static uint32_t getHash(InputSection<ELFT> *S) {
> >    return hash_combine(S->Flags, S->getSize(), S->NumRelocations);
> >  }
> >
> > @@ -128,33 +130,54 @@ template <class ELFT> void ICF<ELFT>::se
> >    // issue in practice because the number of the distinct sections in
> >    // [R.Begin, R.End] is usually very small.
> >    while (R->End - R->Begin > 1) {
> > +    size_t Begin = R->Begin;
> > +    size_t End = R->End;
> > +
> >      // Divide range R into two. Let Mid be the start index of the
> >      // second group.
> >      auto Bound = std::stable_partition(
> > -        Sections.begin() + R->Begin + 1, Sections.begin() + R->End,
> > +        Sections.begin() + Begin + 1, Sections.begin() + End,
> >          [&](InputSection<ELFT> *S) {
> >            if (Constant)
> > -            return equalsConstant(Sections[R->Begin], S);
> > -          return equalsVariable(Sections[R->Begin], S);
> > +            return equalsConstant(Sections[Begin], S);
> > +          return equalsVariable(Sections[Begin], S);
> >          });
> >      size_t Mid = Bound - Sections.begin();
> >
> > -    if (Mid == R->End)
> > +    if (Mid == End)
> >        return;
> >
> > -    // Now we split [R.Begin, R.End) into [R.Begin, Mid) and [Mid,
> R.End).
> > -    if (Mid - R->Begin > 1)
> > -      Ranges.push_back({R->Begin, Mid});
> > -    R->Begin = Mid;
> > -
> > -    // Update GroupIds for the new group members. We use the index of
> > -    // the group first member as a group ID because that is unique.
> > -    for (size_t I = Mid; I < R->End; ++I)
> > -      Sections[I]->GroupId = Mid;
> > -
> > -    // Since we have split a group, we need to repeat the main loop
> > -    // later to obtain a convergence. Remember that.
> > -    Repeat = true;
> > +    // Now we split [Begin, End) into [Begin, Mid) and [Mid, End).
> > +    uint32_t Id;
> > +    Range *NewRange;
> > +    {
> > +      std::lock_guard<std::mutex> Lock(Mu);
> > +      Ranges.push_back({Mid, End});
> > +      NewRange = &Ranges.back();
> > +      Id = NextId++;
> > +    }
> > +    R->End = Mid;
> > +
> > +    // Update GroupIds for the new group members.
> > +    //
> > +    // Note on GroupId[0] and GroupId[1]: we have two storages for
> > +    // group IDs. At the beginning of each iteration of the main loop,
> > +    // both have the same ID. GroupId[0] contains the current ID, and
> > +    // GroupId[1] contains the next ID which will be used in the next
> > +    // iteration.
> > +    //
> > +    // Recall that other threads may be working on other ranges. They
> > +    // may be reading group IDs that we are about to update. We cannot
> > +    // update group IDs in place because it breaks the invariance that
> > +    // all sections in the same group must have the same ID. In other
> > +    // words, the following for loop is not an atomic operation, and
> > +    // that is observable from other threads.
> > +    //
> > +    // By writing new IDs to write-only places, we can keep the
> invariance.
> > +    for (size_t I = Mid; I < End; ++I)
> > +      Sections[I]->GroupId[(Cnt + 1) % 2] = Id;
> > +
> > +    R = NewRange;
> >    }
> >  }
> >
> > @@ -211,7 +234,16 @@ bool ICF<ELFT>::variableEq(const InputSe
> >      auto *Y = dyn_cast<InputSection<ELFT>>(DB->Section);
> >      if (!X || !Y)
> >        return false;
> > -    return X->GroupId != 0 && X->GroupId == Y->GroupId;
> > +    if (X->GroupId[Cnt % 2] == 0)
> > +      return false;
> > +
> > +    // Performance hack for single-thread. If no other threads are
> > +    // running, we can safely read next GroupIDs as there is no race
> > +    // condition. This optimization may reduce the number of
> > +    // iterations of the main loop because we can see results of the
> > +    // same iteration.
> > +    size_t Idx = (Config->Threads ? Cnt : Cnt + 1) % 2;
> > +    return X->GroupId[Idx] == Y->GroupId[Idx];
> >    };
> >
> >    return std::equal(RelsA.begin(), RelsA.end(), RelsB.begin(), Eq);
> > @@ -226,6 +258,14 @@ bool ICF<ELFT>::equalsVariable(const Inp
> >    return variableEq(A, A->rels(), B, B->rels());
> >  }
> >
> > +template <class IterTy, class FuncTy>
> > +static void foreach(IterTy Begin, IterTy End, FuncTy Fn) {
> > +  if (Config->Threads)
> > +    parallel_for_each(Begin, End, Fn);
> > +  else
> > +    std::for_each(Begin, End, Fn);
> > +}
> > +
> >  // The main function of ICF.
> >  template <class ELFT> void ICF<ELFT>::run() {
> >    // Collect sections to merge.
> > @@ -239,14 +279,14 @@ template <class ELFT> void ICF<ELFT>::ru
> >    // guaranteed) to have the same static contents in terms of ICF.
> >    for (InputSection<ELFT> *S : Sections)
> >      // Set MSB to 1 to avoid collisions with non-hash IDs.
> > -    S->GroupId = getHash(S) | (uint64_t(1) << 63);
> > +    S->GroupId[0] = S->GroupId[1] = getHash(S) | (1 << 31);
> >
> >    // From now on, sections in Sections are ordered so that sections in
> >    // the same group are consecutive in the vector.
> >    std::stable_sort(Sections.begin(), Sections.end(),
> >                     [](InputSection<ELFT> *A, InputSection<ELFT> *B) {
> > -                     if (A->GroupId != B->GroupId)
> > -                       return A->GroupId < B->GroupId;
> > +                     if (A->GroupId[0] != B->GroupId[0])
> > +                       return A->GroupId[0] < B->GroupId[0];
> >                       // Within a group, put the highest alignment
> >                       // requirement first, so that's the one we'll keep.
> >                       return B->Alignment < A->Alignment;
> > @@ -260,25 +300,37 @@ template <class ELFT> void ICF<ELFT>::ru
> >    for (size_t I = 0, E = Sections.size(); I < E - 1;) {
> >      // Let J be the first index whose element has a different ID.
> >      size_t J = I + 1;
> > -    while (J < E && Sections[I]->GroupId == Sections[J]->GroupId)
> > +    while (J < E && Sections[I]->GroupId[0] == Sections[J]->GroupId[0])
> >        ++J;
> >      if (J - I > 1)
> >        Ranges.push_back({I, J});
> >      I = J;
> >    }
> >
> > +  // This function copies new GroupIds from former write-only space to
> > +  // former read-only space, so that we can flip GroupId[0] and
> GroupId[1].
> > +  // Note that new GroupIds are always be added to end of Ranges.
> > +  auto Copy = [&](Range &R) {
> > +    for (size_t I = R.Begin; I < R.End; ++I)
> > +      Sections[I]->GroupId[Cnt % 2] = Sections[I]->GroupId[(Cnt + 1) %
> 2];
> > +  };
> > +
> >    // Compare static contents and assign unique IDs for each static
> content.
> > -  std::for_each(Ranges.begin(), Ranges.end(),
> > -                [&](Range &R) { segregate(&R, true); });
> > +  auto End = Ranges.end();
> > +  foreach(Ranges.begin(), End, [&](Range &R) { segregate(&R, true); });
> > +  foreach(End, Ranges.end(), Copy);
> >    ++Cnt;
> >
> >    // Split groups by comparing relocations until convergence is
> obtained.
> > -  do {
> > -    Repeat = false;
> > -    std::for_each(Ranges.begin(), Ranges.end(),
> > -                  [&](Range &R) { segregate(&R, false); });
> > +  for (;;) {
> > +    auto End = Ranges.end();
> > +    foreach(Ranges.begin(), End, [&](Range &R) { segregate(&R, false);
> });
> > +    foreach(End, Ranges.end(), Copy);
> >      ++Cnt;
> > -  } while (Repeat);
> > +
> > +    if (End == Ranges.end())
> > +      break;
> > +  }
> >
> >    log("ICF needed " + Twine(Cnt) + " iterations");
> >
> >
> > Modified: lld/trunk/ELF/InputSection.h
> > URL: http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/
> InputSection.h?rev=288373&r1=288372&r2=288373&view=diff
> > ============================================================
> ==================
> > --- lld/trunk/ELF/InputSection.h (original)
> > +++ lld/trunk/ELF/InputSection.h Thu Dec  1 11:09:04 2016
> > @@ -289,7 +289,7 @@ public:
> >    void relocateNonAlloc(uint8_t *Buf, llvm::ArrayRef<RelTy> Rels);
> >
> >    // Used by ICF.
> > -  uint64_t GroupId = 0;
> > +  uint32_t GroupId[2] = {0, 0};
> >
> >    // Called by ICF to merge two input sections.
> >    void replace(InputSection<ELFT> *Other);
> >
> >
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161207/da8ab693/attachment.html>