<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Mar 12, 2015 at 9:49 AM, Rafael Espíndola <span dir="ltr"><<a href="mailto:rafael.espindola@gmail.com" target="_blank">rafael.espindola@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I tried benchmarking it on linux by linking clang Release+asserts (but<br>

lld itself with no asserts). The first things I noticed were:<br>

<br>

missing options:<br>

<br>

warning: ignoring unknown argument: --no-add-needed<br>

warning: ignoring unknown argument: -O3<br>

warning: ignoring unknown argument: --gc-sections<br>

<br>

I just removed them from the command line.<br>

<br>

Looks like --hash-style=gnu and --build-id are just ignored, so I<br>

removed them too.<br>

<br>

Looks like --strip-all is ignored, so I removed and ran strip manually.<br>

<br>

Looks like .note.GNU-stack is incorrectly added, neither gold nor<br>

bfd.ld adds it for clang.<br>

<br>

Looks like .gnu.version and .gnu.version_r are not implemented.<br>

<br>

Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why<br>

it is not included in .got.<br>

<br>

Gold produces a .data.rel.ro.local. lld produces a .data.rel.local.<br>

bfd puts everything in .data.rel. I have to research a bit to find out<br>

what this is. For now I just added the sizes into a single entry.<br>

<br>

.eh_frame_hdr is effectively empty on lld. I removed --eh-frame-hdr<br>

from the command line.<br>

<br>

With all that, the sections that increased in size the most when using lld were:<br>

<br>

.rodata: 9 449 278 bytes bigger<br>

.eh_frame: 438 376 bytes bigger<br>

.comment: 77 797 bytes bigger<br>

.<a href="http://data.rel.ro" target="_blank">data.rel.ro</a>: 48 056 bytes bigger<br>

<br>

The comment section is bigger because it has multiple copies of<br>

<br>

clang version 3.7.0 (trunk 232021) (llvm/trunk 232027)<br>

<br>

The lack of duplicate entry merging would also explain the size<br>

difference of .rodata and .eh_frame. No idea why .<a href="http://data.rel.ro" target="_blank">data.rel.ro</a> is<br>

bigger.<br>

<br>

So, with the big warning that both linkers are not doing exactly the<br>

same thing, the performance numbers I got were:<br>

<br>

lld:<br>

<br>

<br>

       1961.842991      task-clock (msec)         #    0.999 CPUs<br>

utilized            ( +-  0.04% )<br>

             1,152      context-switches          #    0.587 K/sec<br>

                 0      cpu-migrations            #    0.000 K/sec<br>

               ( +-100.00% )<br>

           199,310      page-faults               #    0.102 M/sec<br>

               ( +-  0.00% )<br>

     5,893,291,145      cycles                    #    3.004 GHz<br>

               ( +-  0.03% )<br>

     3,329,741,079      stalled-cycles-frontend   #   56.50% frontend<br>

cycles idle     ( +-  0.05% )<br>

   <not supported>      stalled-cycles-backend<br>

     6,255,727,902      instructions              #    1.06  insns per<br>

cycle<br>

                                                  #    0.53  stalled<br>

cycles per insn  ( +-  0.01% )<br>

     1,295,893,191      branches                  #  660.549 M/sec<br>

               ( +-  0.01% )<br>

        26,760,734      branch-misses             #    2.07% of all<br>

branches          ( +-  0.01% )<br>

<br>

       1.963705923 seconds time elapsed<br>

          ( +-  0.04% )<br>

<br>

gold:<br>

<br>

        990.708786      task-clock (msec)         #    0.999 CPUs<br>

utilized            ( +-  0.06% )<br>

                 0      context-switches          #    0.000 K/sec<br>

                 0      cpu-migrations            #    0.000 K/sec<br>

               ( +-100.00% )<br>

            77,840      page-faults               #    0.079 M/sec<br>

     2,976,552,629      cycles                    #    3.004 GHz<br>

               ( +-  0.02% )<br>

     1,384,720,988      stalled-cycles-frontend   #   46.52% frontend<br>

cycles idle     ( +-  0.04% )<br>

   <not supported>      stalled-cycles-backend<br>

     4,105,948,264      instructions              #    1.38  insns per<br>

cycle<br>

                                                  #    0.34  stalled<br>

cycles per insn  ( +-  0.00% )<br>

       868,894,366      branches                  #  877.043 M/sec<br>

               ( +-  0.00% )<br>

        15,426,051      branch-misses             #    1.78% of all<br>

branches          ( +-  0.01% )<br>

<br>

       0.991619294 seconds time elapsed<br>

          ( +-  0.06% )<br>

<br>

<br>

The biggest difference that shows up is that lld has 1,152 context<br>

switches, but the cpu utilization is still < 1. Maybe there is just a<br>

threading bug somewhere?<br></blockquote><div><br></div><div>The implementation of the threading class inside LLD is different between Windows and other platforms. On Windows, it's just a wrapper for Microsoft Concrt threading library. On other platforms, we have a simple implementation to mimic it. So, first of all, I don't know about the number measured on Unix. (I didn't test that.)</div><div><br></div><div>But, 1,152 context switches is small number, I guess? It's unlikely that that number of context switches would make LLD two times slower than gold. I believe bottleneck is something else. I think no one really optimized ELF reader, passes and writers, there might be some bad code there, but probably I shouldn't make a guess but instead measure.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

>From your description, we build a hash of symbols in an archive and<br>

for each undefined symbol in the overall link check if it is there. It<br>

would probably be more efficient to walk the symbols defined in an<br>

archive and check if it is needed by the overall link status, no? That<br>

would save building a hash table for each archive.<br>

<br>

One big difference of how lld works is the atom model. It basically<br>

creates one Atom per symbol. That is inherently more work than what is<br>

done by gold. IMHO it is confusing what atoms are and one way to<br>

specify atoms in the object files.<br>

<br>

It would be interesting to define an atom as the smallest thing that<br>

cannot be split. It could still have multiple symbols in it for<br>

example, and there would be no such thing as a AbsoltueAtom, just an<br>

AbsoluteSymbol. In this model, the MachO reader would use symbols to<br>

create atoms, but that is just one way to do it. The elf reader would<br>

create 1 atom per regular section and special case SHF_MERGE and<br>

.eh_frame (but we should really fix this one in LLVM too).<br>

<br>

The atoms created in this way (for ELF at least) would be freely<br>

movable, further reducing the cost.<br></blockquote><div><br></div><div>I think I agree. Or, at least, the term "atom" is odd because it's not atomic. What we call atom is symbol with associated data. Usually all atoms created from the same section will be linked or excluded as a group. Section is not divisible (or atomic).</div><div><br></div><div>We don't have notion of section in the resolver. Many linker features are defined in terms of sections, so in order to handle them in the atom model, we need to do something not straightforward. (For example, we copy section attributes to atoms so that they are preserved during linking process. Which means we need to copy attributes to atoms although atoms created from the same section will have the exact same values.)</div></div></div></div>