<div dir="ltr">Thank you for testing this. This algorithm is lock-free (and even wait-free), so it should scale well.</div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Dec 7, 2016 at 10:22 AM, Rafael Avila de Espindola <span dir="ltr"><<a href="mailto:rafael.espindola@gmail.com" target="_blank">rafael.espindola@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

Impressive.<br>

<br>

In sigle thread mode the link of chromium goes from 4.140444332 to<br>

<a href="tel:4.244436379" value="+14244436379">4.244436379</a> seconds.<br>

<br>

With multiple threads and 2 cores available, the difference is already<br>

from 4.161336277 to 3.340861790.<br>

<br>

With 4 cores it goes from 4.092804448 to 2.921354616 and with 8 from<br>

4.047472207 to 2.754971966.<br>

<br>

Cheers,<br>

Rafael<br>

<div class="HOEnZb"><div class="h5"><br>

<br>

Rui Ueyama via llvm-commits <<a href="mailto:llvm-commits@lists.llvm.org">llvm-commits@lists.llvm.org</a>> writes:<br>

<br>

> Author: ruiu<br>

> Date: Thu Dec  1 11:09:04 2016<br>

> New Revision: 288373<br>

><br>

> URL: <a href="http://llvm.org/viewvc/llvm-project?rev=288373&view=rev" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-<wbr>project?rev=288373&view=rev</a><br>

> Log:<br>

> Parallelize ICF to make LLD's ICF really fast.<br>

><br>

> ICF is short for Identical Code Folding. It is a size optimization to<br>

> identify two or more functions that happened to have the same contents<br>

> to merges them. It usually reduces output size by a few percent.<br>

><br>

> ICF is slow because it is computationally intensive process. I tried<br>

> to paralellize it before but failed because I couldn't make a<br>

> parallelized version produce consistent outputs. Although it didn't<br>

> create broken executables, every invocation of the linker generated<br>

> slightly different output, and I couldn't figure out why.<br>

><br>

> I think I now understand what was going on, and also came up with a<br>

> simple algorithm to fix it. So is this patch.<br>

><br>

> The result is very exciting. Chromium for example has 780,662 input<br>

> sections in which 20,774 are reducible by ICF. LLD previously took<br>

> 7.980 seconds for ICF. Now it finishes in 1.065 seconds.<br>

><br>

> As a result, LLD can now link a Chromium binary (output size 1.59 GB)<br>

> in 10.28 seconds on my machine with ICF enabled. Compared to gold<br>

> which takes 40.94 seconds to do the same thing, this is an amazing<br>

> number.<br>

><br>

> From here, I'll describe what we are doing for ICF, what was the<br>

> previous problem, and what I did in this patch.<br>

><br>

> In ICF, two sections are considered identical if they have the same<br>

> section flags, section data, and relocations. Relocations are tricky,<br>

> becuase two relocations are considered the same if they have the same<br>

> relocation type, values, and if they point to the same section _in<br>

> terms of ICF_.<br>

><br>

> Here is an example. If foo and bar defined below are compiled to the<br>

> same machine instructions, ICF can (and should) merge the two,<br>

> although their relocations point to each other.<br>

><br>

>   void foo() { bar(); }<br>

>   void bar() { foo(); }<br>

><br>

> This is not an easy problem to solve.<br>

><br>

> What we are doing in LLD is some sort of coloring algorithm. We color<br>

> non-identical sections using different colors repeatedly, and sections<br>

> in the same color when the algorithm terminates are considered<br>

> identical. Here is the details:<br>

><br>

>   1. First, we color all sections using their hash values of section<br>

>   types, section contents, and numbers of relocations. At this moment,<br>

>   relocation targets are not taken into account. We just color<br>

>   sections that apparently differ in different colors.<br>

><br>

>   2. Next, for each color C, we visit sections having color C to see<br>

>   if their relocations are the same. Relocations are considered equal<br>

>   if their targets have the same color. We then recolor sections that<br>

>   have different relocation targets in new colors.<br>

><br>

>   3. If we recolor some section in step 2, relocations that were<br>

>   previously pointing to the same color targets may now be pointing to<br>

>   different colors. Therefore, repeat 2 until a convergence is<br>

>   obtained.<br>

><br>

> Step 2 is a heavy operation. For Chromium, the first iteration of step<br>

> 2 takes 2.882 seconds, and the second iteration takes 1.038 seconds,<br>

> and in total it needs 23 iterations.<br>

><br>

> Parallelizing step 1 is easy because we can color each section<br>

> independently. This patch does that.<br>

><br>

> Parallelizing step 2 is tricky. We could work on each color<br>

> independently, but we cannot recolor sections in place, because it<br>

> will break the invariance that two possibly-identical sections must<br>

> have the same color at any moment.<br>

><br>

> Consider sections S1, S2, S3, S4 in the same color C, where S1 and S2<br>

> are identical, S3 and S4 are identical, but S2 and S3 are not. Thread<br>

> A is about to recolor S1 and S2 in C'. After thread A recolor S1 in<br>

> C', but before recolor S2 in C', other thread B might observe S1 and<br>

> S2. Then thread B will conclude that S1 and S2 are different, and it<br>

> will split thread B's sections into smaller groups wrongly. Over-<br>

> splitting doesn't produce broken results, but it loses a chance to<br>

> merge some identical sections. That was the cause of indeterminism.<br>

><br>

> To fix the problem, I made sections have two colors, namely current<br>

> color and next color. At the beginning of each iteration, both colors<br>

> are the same. Each thread reads from current color and writes to next<br>

> color. In this way, we can avoid threads from reading partial<br>

> results. After each iteration, we flip current and next.<br>

><br>

> This is a very simple solution and is implemented in less than 50<br>

> lines of code.<br>

><br>

> I tested this patch with Chromium and confirmed that this parallelized<br>

> ICF produces the identical output as the non-parallelized one.<br>

><br>

> Differential Revision: <a href="https://reviews.llvm.org/D27247" rel="noreferrer" target="_blank">https://reviews.llvm.org/<wbr>D27247</a><br>

><br>

> Modified:<br>

>     lld/trunk/ELF/ICF.cpp<br>

>     lld/trunk/ELF/InputSection.h<br>

><br>

> Modified: lld/trunk/ELF/ICF.cpp<br>

> URL: <a href="http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/ICF.cpp?rev=288373&r1=288372&r2=288373&view=diff" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-<wbr>project/lld/trunk/ELF/ICF.cpp?<wbr>rev=288373&r1=288372&r2=<wbr>288373&view=diff</a><br>

> ==============================<wbr>==============================<wbr>==================<br>

> --- lld/trunk/ELF/ICF.cpp (original)<br>

> +++ lld/trunk/ELF/ICF.cpp Thu Dec  1 11:09:04 2016<br>

> @@ -59,10 +59,12 @@<br>

>  #include "Config.h"<br>

>  #include "SymbolTable.h"<br>

><br>

> +#include "lld/Core/Parallel.h"<br>

>  #include "llvm/ADT/Hashing.h"<br>

>  #include "llvm/Object/ELF.h"<br>

>  #include "llvm/Support/ELF.h"<br>

>  #include <algorithm><br>

> +#include <mutex><br>

><br>

>  using namespace lld;<br>

>  using namespace lld::elf;<br>

> @@ -95,16 +97,16 @@ private:<br>

><br>

>    std::vector<InputSection<ELFT> *> Sections;<br>

>    std::vector<Range> Ranges;<br>

> +  std::mutex Mu;<br>

><br>

> -  // The main loop is repeated until we get a convergence.<br>

> -  bool Repeat = false; // If Repeat is true, we need to repeat.<br>

> -  int Cnt = 0;         // Counter for the main loop.<br>

> +  uint32_t NextId = 1;<br>

> +  int Cnt = 0;<br>

>  };<br>

>  }<br>

><br>

>  // Returns a hash value for S. Note that the information about<br>

>  // relocation targets is not included in the hash value.<br>

> -template <class ELFT> static uint64_t getHash(InputSection<ELFT> *S) {<br>

> +template <class ELFT> static uint32_t getHash(InputSection<ELFT> *S) {<br>

>    return hash_combine(S->Flags, S->getSize(), S->NumRelocations);<br>

>  }<br>

><br>

> @@ -128,33 +130,54 @@ template <class ELFT> void ICF<ELFT>::se<br>

>    // issue in practice because the number of the distinct sections in<br>

>    // [R.Begin, R.End] is usually very small.<br>

>    while (R->End - R->Begin > 1) {<br>

> +    size_t Begin = R->Begin;<br>

> +    size_t End = R->End;<br>

> +<br>

>      // Divide range R into two. Let Mid be the start index of the<br>

>      // second group.<br>

>      auto Bound = std::stable_partition(<br>

> -        Sections.begin() + R->Begin + 1, Sections.begin() + R->End,<br>

> +        Sections.begin() + Begin + 1, Sections.begin() + End,<br>

>          [&](InputSection<ELFT> *S) {<br>

>            if (Constant)<br>

> -            return equalsConstant(Sections[R-><wbr>Begin], S);<br>

> -          return equalsVariable(Sections[R-><wbr>Begin], S);<br>

> +            return equalsConstant(Sections[Begin]<wbr>, S);<br>

> +          return equalsVariable(Sections[Begin]<wbr>, S);<br>

>          });<br>

>      size_t Mid = Bound - Sections.begin();<br>

><br>

> -    if (Mid == R->End)<br>

> +    if (Mid == End)<br>

>        return;<br>

><br>

> -    // Now we split [R.Begin, R.End) into [R.Begin, Mid) and [Mid, R.End).<br>

> -    if (Mid - R->Begin > 1)<br>

> -      Ranges.push_back({R->Begin, Mid});<br>

> -    R->Begin = Mid;<br>

> -<br>

> -    // Update GroupIds for the new group members. We use the index of<br>

> -    // the group first member as a group ID because that is unique.<br>

> -    for (size_t I = Mid; I < R->End; ++I)<br>

> -      Sections[I]->GroupId = Mid;<br>

> -<br>

> -    // Since we have split a group, we need to repeat the main loop<br>

> -    // later to obtain a convergence. Remember that.<br>

> -    Repeat = true;<br>

> +    // Now we split [Begin, End) into [Begin, Mid) and [Mid, End).<br>

> +    uint32_t Id;<br>

> +    Range *NewRange;<br>

> +    {<br>

> +      std::lock_guard<std::mutex> Lock(Mu);<br>

> +      Ranges.push_back({Mid, End});<br>

> +      NewRange = &Ranges.back();<br>

> +      Id = NextId++;<br>

> +    }<br>

> +    R->End = Mid;<br>

> +<br>

> +    // Update GroupIds for the new group members.<br>

> +    //<br>

> +    // Note on GroupId[0] and GroupId[1]: we have two storages for<br>

> +    // group IDs. At the beginning of each iteration of the main loop,<br>

> +    // both have the same ID. GroupId[0] contains the current ID, and<br>

> +    // GroupId[1] contains the next ID which will be used in the next<br>

> +    // iteration.<br>

> +    //<br>

> +    // Recall that other threads may be working on other ranges. They<br>

> +    // may be reading group IDs that we are about to update. We cannot<br>

> +    // update group IDs in place because it breaks the invariance that<br>

> +    // all sections in the same group must have the same ID. In other<br>

> +    // words, the following for loop is not an atomic operation, and<br>

> +    // that is observable from other threads.<br>

> +    //<br>

> +    // By writing new IDs to write-only places, we can keep the invariance.<br>

> +    for (size_t I = Mid; I < End; ++I)<br>

> +      Sections[I]->GroupId[(Cnt + 1) % 2] = Id;<br>

> +<br>

> +    R = NewRange;<br>

>    }<br>

>  }<br>

><br>

> @@ -211,7 +234,16 @@ bool ICF<ELFT>::variableEq(const InputSe<br>

>      auto *Y = dyn_cast<InputSection<ELFT>>(<wbr>DB->Section);<br>

>      if (!X || !Y)<br>

>        return false;<br>

> -    return X->GroupId != 0 && X->GroupId == Y->GroupId;<br>

> +    if (X->GroupId[Cnt % 2] == 0)<br>

> +      return false;<br>

> +<br>

> +    // Performance hack for single-thread. If no other threads are<br>

> +    // running, we can safely read next GroupIDs as there is no race<br>

> +    // condition. This optimization may reduce the number of<br>

> +    // iterations of the main loop because we can see results of the<br>

> +    // same iteration.<br>

> +    size_t Idx = (Config->Threads ? Cnt : Cnt + 1) % 2;<br>

> +    return X->GroupId[Idx] == Y->GroupId[Idx];<br>

>    };<br>

><br>

>    return std::equal(RelsA.begin(), RelsA.end(), RelsB.begin(), Eq);<br>

> @@ -226,6 +258,14 @@ bool ICF<ELFT>::equalsVariable(<wbr>const Inp<br>

>    return variableEq(A, A->rels(), B, B->rels());<br>

>  }<br>

><br>

> +template <class IterTy, class FuncTy><br>

> +static void foreach(IterTy Begin, IterTy End, FuncTy Fn) {<br>

> +  if (Config->Threads)<br>

> +    parallel_for_each(Begin, End, Fn);<br>

> +  else<br>

> +    std::for_each(Begin, End, Fn);<br>

> +}<br>

> +<br>

>  // The main function of ICF.<br>

>  template <class ELFT> void ICF<ELFT>::run() {<br>

>    // Collect sections to merge.<br>

> @@ -239,14 +279,14 @@ template <class ELFT> void ICF<ELFT>::ru<br>

>    // guaranteed) to have the same static contents in terms of ICF.<br>

>    for (InputSection<ELFT> *S : Sections)<br>

>      // Set MSB to 1 to avoid collisions with non-hash IDs.<br>

> -    S->GroupId = getHash(S) | (uint64_t(1) << 63);<br>

> +    S->GroupId[0] = S->GroupId[1] = getHash(S) | (1 << 31);<br>

><br>

>    // From now on, sections in Sections are ordered so that sections in<br>

>    // the same group are consecutive in the vector.<br>

>    std::stable_sort(Sections.<wbr>begin(), Sections.end(),<br>

>                     [](InputSection<ELFT> *A, InputSection<ELFT> *B) {<br>

> -                     if (A->GroupId != B->GroupId)<br>

> -                       return A->GroupId < B->GroupId;<br>

> +                     if (A->GroupId[0] != B->GroupId[0])<br>

> +                       return A->GroupId[0] < B->GroupId[0];<br>

>                       // Within a group, put the highest alignment<br>

>                       // requirement first, so that's the one we'll keep.<br>

>                       return B->Alignment < A->Alignment;<br>

> @@ -260,25 +300,37 @@ template <class ELFT> void ICF<ELFT>::ru<br>

>    for (size_t I = 0, E = Sections.size(); I < E - 1;) {<br>

>      // Let J be the first index whose element has a different ID.<br>

>      size_t J = I + 1;<br>

> -    while (J < E && Sections[I]->GroupId == Sections[J]->GroupId)<br>

> +    while (J < E && Sections[I]->GroupId[0] == Sections[J]->GroupId[0])<br>

>        ++J;<br>

>      if (J - I > 1)<br>

>        Ranges.push_back({I, J});<br>

>      I = J;<br>

>    }<br>

><br>

> +  // This function copies new GroupIds from former write-only space to<br>

> +  // former read-only space, so that we can flip GroupId[0] and GroupId[1].<br>

> +  // Note that new GroupIds are always be added to end of Ranges.<br>

> +  auto Copy = [&](Range &R) {<br>

> +    for (size_t I = R.Begin; I < R.End; ++I)<br>

> +      Sections[I]->GroupId[Cnt % 2] = Sections[I]->GroupId[(Cnt + 1) % 2];<br>

> +  };<br>

> +<br>

>    // Compare static contents and assign unique IDs for each static content.<br>

> -  std::for_each(Ranges.begin(), Ranges.end(),<br>

> -                [&](Range &R) { segregate(&R, true); });<br>

> +  auto End = Ranges.end();<br>

> +  foreach(Ranges.begin(), End, [&](Range &R) { segregate(&R, true); });<br>

> +  foreach(End, Ranges.end(), Copy);<br>

>    ++Cnt;<br>

><br>

>    // Split groups by comparing relocations until convergence is obtained.<br>

> -  do {<br>

> -    Repeat = false;<br>

> -    std::for_each(Ranges.begin(), Ranges.end(),<br>

> -                  [&](Range &R) { segregate(&R, false); });<br>

> +  for (;;) {<br>

> +    auto End = Ranges.end();<br>

> +    foreach(Ranges.begin(), End, [&](Range &R) { segregate(&R, false); });<br>

> +    foreach(End, Ranges.end(), Copy);<br>

>      ++Cnt;<br>

> -  } while (Repeat);<br>

> +<br>

> +    if (End == Ranges.end())<br>

> +      break;<br>

> +  }<br>

><br>

>    log("ICF needed " + Twine(Cnt) + " iterations");<br>

><br>

><br>

> Modified: lld/trunk/ELF/InputSection.h<br>

> URL: <a href="http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/InputSection.h?rev=288373&r1=288372&r2=288373&view=diff" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-<wbr>project/lld/trunk/ELF/<wbr>InputSection.h?rev=288373&r1=<wbr>288372&r2=288373&view=diff</a><br>

> ==============================<wbr>==============================<wbr>==================<br>

> --- lld/trunk/ELF/InputSection.h (original)<br>

> +++ lld/trunk/ELF/InputSection.h Thu Dec  1 11:09:04 2016<br>

> @@ -289,7 +289,7 @@ public:<br>

>    void relocateNonAlloc(uint8_t *Buf, llvm::ArrayRef<RelTy> Rels);<br>

><br>

>    // Used by ICF.<br>

> -  uint64_t GroupId = 0;<br>

> +  uint32_t GroupId[2] = {0, 0};<br>

><br>

>    // Called by ICF to merge two input sections.<br>

>    void replace(InputSection<ELFT> *Other);<br>

><br>

><br>

> ______________________________<wbr>_________________<br>

> llvm-commits mailing list<br>

> <a href="mailto:llvm-commits@lists.llvm.org">llvm-commits@lists.llvm.org</a><br>

> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-commits</a><br>

</div></div></blockquote></div><br></div>