<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Sat, Nov 26, 2016 at 5:09 PM, Sean Silva <span dir="ltr"><<a href="mailto:chisophugis@gmail.com" target="_blank">chisophugis@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Looking at the perf output is interesting:</div><div><br></div><div>Before:        2,528,130 page-faults               #    0.082 M/sec                    ( +-  0.47% )</div><div>After:        2,372,483 page-faults               #    0.067 M/sec                    ( +-  0.36% )</div><div><br></div><div>Observation: Page faults decreased by over 5% with this change.<br></div><div>The only thing I can think of that could cause this is that less overall memory is being allocated from the operating system somehow (maybe malloc can reuse buffers better when this is done in parallel?).</div><div><br></div><div><br></div><div>Before:</div><span class=""><div><br></div><div>  67,352,463,373 stalled-cycles-frontend   #   78.94% frontend cycles idle     ( +-  3.06% )<br></div></span><span class=""><div>  44,295,945,493 instructions              #    0.52  insns per cycle<br>                                           #    1.52  stalled cycles per insn  ( +-  0.44% )<br></div></span><div>After:</div><span class=""><div><br></div><div>  79,294,670,125 stalled-cycles-frontend   #   80.59% frontend cycles idle     ( +-  3.03% )<br></div></span><span class=""><div><div>  46,274,151,813 instructions              #    0.47  insns per cycle</div><div>                                           #    1.71  stalled cycles per insn  ( +-  0.47% )</div></div><div><br></div></span><div>Observation: LLD is getting very poor processor utilization. The CPU is spending spends most of its time stalled.</div></div></blockquote><div><br></div><div>We probably need a function that returns the number of idle cores instead of the number of existing cores, to enable/disable threading. Even if our parallel algorithm achieves the same performance as a non-parallel algorithm in theory, there's costs involving thread creation, coordination, etc. in reality. If all CPUs are busy, we should run parallel for loops using a single thread instead of multiple threads.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><span class="HOEnZb"><font color="#888888"><div><br></div><div><br></div><div>-- Sean Silva</div></font></span><div><div class="h5"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Nov 25, 2016 at 12:05 PM, Rui Ueyama via llvm-commits <span dir="ltr"><<a href="mailto:llvm-commits@lists.llvm.org" target="_blank">llvm-commits@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Author: ruiu<br>

Date: Fri Nov 25 14:05:08 2016<br>

New Revision: 287946<br>

<br>

URL: <a href="http://llvm.org/viewvc/llvm-project?rev=287946&view=rev" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-pr<wbr>oject?rev=287946&view=rev</a><br>

Log:<br>

Parallelize uncompress() and splitIntoPieces().<br>

<br>

Uncompressing section contents and spliting mergeable section contents<br>

into smaller chunks are heavy tasks. They scan entire section contents<br>

and do CPU-intensive tasks such as uncompressing zlib-compressed data<br>

or computing a hash value for each section piece.<br>

<br>

Luckily, these tasks are independent to each other, so we can do that<br>

in parallel_for_each. The number of input sections is large (as opposed<br>

to the number of output sections), so there's a large parallelism here.<br>

<br>

Actually the current design to call uncompress() and splitIntoPieces()<br>

in batch was chosen with doing this in mind. Basically what we need to<br>

do here is to replace `for` with `parallel_for_each`.<br>

<br>

It seems this patch improves latency significantly if linked programs<br>

contain debug info (which in turn contain lots of mergeable strings.)<br>

For example, the latency to link Clang (debug build) improved by 20% on<br>

my machine as shown below. Note that ld.gold took 19.2 seconds to do<br>

the same thing.<br>

<br>

Before:<br>

    30801.782712 task-clock (msec)         #    3.652 CPUs utilized            ( +-  2.59% )<br>

         104,084 context-switches          #    0.003 M/sec                    ( +-  1.02% )<br>

           5,063 cpu-migrations            #    0.164 K/sec                    ( +- 13.66% )<br>

       2,528,130 page-faults               #    0.082 M/sec                    ( +-  0.47% )<br>

  85,317,809,130 cycles                    #    2.770 GHz                      ( +-  2.62% )<br>

  67,352,463,373 stalled-cycles-frontend   #   78.94% frontend cycles idle     ( +-  3.06% )<br>

 <not supported> stalled-cycles-backend<br>

  44,295,945,493 instructions              #    0.52  insns per cycle<br>

                                           #    1.52  stalled cycles per insn  ( +-  0.44% )<br>

   8,572,384,877 branches                  #  278.308 M/sec                    ( +-  0.66% )<br>

     141,806,726 branch-misses             #    1.65% of all branches          ( +-  0.13% )<br>

<br>

     8.433424003 seconds time elapsed                                          ( +-  1.20% )<br>

<br>

After:<br>

    35523.764575 task-clock (msec)         #    5.265 CPUs utilized            ( +-  2.67% )<br>

         159,107 context-switches          #    0.004 M/sec                    ( +-  0.48% )<br>

           8,123 cpu-migrations            #    0.229 K/sec                    ( +- 23.34% )<br>

       2,372,483 page-faults               #    0.067 M/sec                    ( +-  0.36% )<br>

  98,395,342,152 cycles                    #    2.770 GHz                      ( +-  2.62% )<br>

  79,294,670,125 stalled-cycles-frontend   #   80.59% frontend cycles idle     ( +-  3.03% )<br>

 <not supported> stalled-cycles-backend<br>

  46,274,151,813 instructions              #    0.47  insns per cycle<br>

                                           #    1.71  stalled cycles per insn  ( +-  0.47% )<br>

   8,987,621,670 branches                  #  253.003 M/sec                    ( +-  0.60% )<br>

     148,900,624 branch-misses             #    1.66% of all branches          ( +-  0.27% )<br>

<br>

     6.747548004 seconds time elapsed                                          ( +-  0.40% )<br>

<br>

Modified:<br>

    lld/trunk/ELF/Driver.cpp<br>

    lld/trunk/ELF/InputSection.cpp<br>

<br>

Modified: lld/trunk/ELF/Driver.cpp<br>

URL: <a href="http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/Driver.cpp?rev=287946&r1=287945&r2=287946&view=diff" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-pr<wbr>oject/lld/trunk/ELF/Driver.cpp<wbr>?rev=287946&r1=287945&r2=28794<wbr>6&view=diff</a><br>

==============================<wbr>==============================<wbr>==================<br>

--- lld/trunk/ELF/Driver.cpp (original)<br>

+++ lld/trunk/ELF/Driver.cpp Fri Nov 25 14:05:08 2016<br>

@@ -20,6 +20,7 @@<br>

 #include "Target.h"<br>

 #include "Writer.h"<br>

 #include "lld/Config/Version.h"<br>

+#include "lld/Core/Parallel.h"<br>

 #include "lld/Driver/Driver.h"<br>

 #include "llvm/ADT/StringExtras.h"<br>

 #include "llvm/ADT/StringSwitch.h"<br>

@@ -800,14 +801,15 @@ template <class ELFT> void LinkerDriver:<br>

<br>

   // MergeInputSection::splitIntoPi<wbr>eces needs to be called before<br>

   // any call of MergeInputSection::getOffset. Do that.<br>

-  for (InputSectionBase<ELFT> *S : Symtab.Sections) {<br>

-    if (!S->Live)<br>

-      continue;<br>

-    if (S->Compressed)<br>

-      S->uncompress();<br>

-    if (auto *MS = dyn_cast<MergeInputSection<ELF<wbr>T>>(S))<br>

-      MS->splitIntoPieces();<br>

-  }<br>

+  parallel_for_each(Symtab.Secti<wbr>ons.begin(), Symtab.Sections.end(),<br>

+                    [](InputSectionBase<ELFT> *S) {<br>

+                      if (!S->Live)<br>

+                        return;<br>

+                      if (S->Compressed)<br>

+                        S->uncompress();<br>

+                      if (auto *MS = dyn_cast<MergeInputSection<ELF<wbr>T>>(S))<br>

+                        MS->splitIntoPieces();<br>

+                    });<br>

<br>

   // Write the result to the file.<br>

   writeResult<ELFT>();<br>

<br>

Modified: lld/trunk/ELF/InputSection.cpp<br>

URL: <a href="http://llvm.org/viewvc/llvm-project/lld/trunk/ELF/InputSection.cpp?rev=287946&r1=287945&r2=287946&view=diff" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-pr<wbr>oject/lld/trunk/ELF/InputSecti<wbr>on.cpp?rev=287946&r1=287945&<wbr>r2=287946&view=diff</a><br>

==============================<wbr>==============================<wbr>==================<br>

--- lld/trunk/ELF/InputSection.cpp (original)<br>

+++ lld/trunk/ELF/InputSection.cpp Fri Nov 25 14:05:08 2016<br>

@@ -22,6 +22,7 @@<br>

<br>

 #include "llvm/Support/Compression.h"<br>

 #include "llvm/Support/Endian.h"<br>

+#include <mutex><br>

<br>

 using namespace llvm;<br>

 using namespace llvm::ELF;<br>

@@ -160,6 +161,8 @@ InputSectionBase<ELFT>::getRaw<wbr>Compressed<br>

   return {Data.slice(sizeof(*Hdr)), read64be(Hdr->Size)};<br>

 }<br>

<br>

+// Uncompress section contents. Note that this function is called<br>

+// from parallel_for_each, so it must be thread-safe.<br>

 template <class ELFT> void InputSectionBase<ELFT>::uncomp<wbr>ress() {<br>

   if (!zlib::isAvailable())<br>

     fatal(toString(this) +<br>

@@ -179,7 +182,12 @@ template <class ELFT> void InputSectionB<br>

     std::tie(Buf, Size) = getRawCompressedData(Data);<br>

<br>

   // Uncompress Buf.<br>

-  char *OutputBuf = BAlloc.Allocate<char>(Size);<br>

+  char *OutputBuf;<br>

+  {<br>

+    static std::mutex Mu;<br>

+    std::lock_guard<std::mutex> Lock(Mu);<br>

+    OutputBuf = BAlloc.Allocate<char>(Size);<br>

+  }<br>

   if (zlib::uncompress(toStringRef(<wbr>Buf), OutputBuf, Size) != zlib::StatusOK)<br>

     fatal(toString(this) + ": error while uncompressing section");<br>

   Data = ArrayRef<uint8_t>((uint8_t *)OutputBuf, Size);<br>

@@ -746,6 +754,12 @@ MergeInputSection<ELFT>::Merge<wbr>InputSecti<br>

                                            StringRef Name)<br>

     : InputSectionBase<ELFT>(F, Header, Name, InputSectionBase<ELFT>::Merge) {}<br>

<br>

+// This function is called after we obtain a complete list of input sections<br>

+// that need to be linked. This is responsible to split section contents<br>

+// into small chunks for further processing.<br>

+//<br>

+// Note that this function is called from parallel_for_each. This must be<br>

+// thread-safe (i.e. no memory allocation from the pools).<br>

 template <class ELFT> void MergeInputSection<ELFT>::split<wbr>IntoPieces() {<br>

   ArrayRef<uint8_t> Data = this->Data;<br>

   uintX_t EntSize = this->Entsize;<br>

<br>

<br>

______________________________<wbr>_________________<br>

llvm-commits mailing list<br>

<a href="mailto:llvm-commits@lists.llvm.org" target="_blank">llvm-commits@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-commits</a><br>

</blockquote></div><br></div></div></div></div>

</blockquote></div><br></div></div>