[LLVMdev] On LLD performance

Thu Mar 12 09:49:19 PDT 2015

I tried benchmarking it on linux by linking clang Release+asserts (but
lld itself with no asserts). The first things I noticed were:

missing options:

warning: ignoring unknown argument: --no-add-needed
warning: ignoring unknown argument: -O3
warning: ignoring unknown argument: --gc-sections

I just removed them from the command line.

Looks like --hash-style=gnu and --build-id are just ignored, so I
removed them too.

Looks like --strip-all is ignored, so I removed and ran strip manually.

Looks like .note.GNU-stack is incorrectly added, neither gold nor
bfd.ld adds it for clang.

Looks like .gnu.version and .gnu.version_r are not implemented.

Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
it is not included in .got.

Gold produces a .data.rel.ro.local. lld produces a .data.rel.local.
bfd puts everything in .data.rel. I have to research a bit to find out
what this is. For now I just added the sizes into a single entry.

.eh_frame_hdr is effectively empty on lld. I removed --eh-frame-hdr
from the command line.

With all that, the sections that increased in size the most when using lld were:

.rodata: 9 449 278 bytes bigger
.eh_frame: 438 376 bytes bigger
.comment: 77 797 bytes bigger
.data.rel.ro: 48 056 bytes bigger

The comment section is bigger because it has multiple copies of

clang version 3.7.0 (trunk 232021) (llvm/trunk 232027)

The lack of duplicate entry merging would also explain the size
difference of .rodata and .eh_frame. No idea why .data.rel.ro is
bigger.

So, with the big warning that both linkers are not doing exactly the
same thing, the performance numbers I got were:

lld:

       1961.842991      task-clock (msec)         #    0.999 CPUs
utilized            ( +-  0.04% )
             1,152      context-switches          #    0.587 K/sec
                 0      cpu-migrations            #    0.000 K/sec
               ( +-100.00% )
           199,310      page-faults               #    0.102 M/sec
               ( +-  0.00% )
     5,893,291,145      cycles                    #    3.004 GHz
               ( +-  0.03% )
     3,329,741,079      stalled-cycles-frontend   #   56.50% frontend
cycles idle     ( +-  0.05% )
   <not supported>      stalled-cycles-backend
     6,255,727,902      instructions              #    1.06  insns per
cycle
                                                  #    0.53  stalled
cycles per insn  ( +-  0.01% )
     1,295,893,191      branches                  #  660.549 M/sec
               ( +-  0.01% )
        26,760,734      branch-misses             #    2.07% of all
branches          ( +-  0.01% )

       1.963705923 seconds time elapsed
          ( +-  0.04% )

gold:

        990.708786      task-clock (msec)         #    0.999 CPUs
utilized            ( +-  0.06% )
                 0      context-switches          #    0.000 K/sec
                 0      cpu-migrations            #    0.000 K/sec
               ( +-100.00% )
            77,840      page-faults               #    0.079 M/sec
     2,976,552,629      cycles                    #    3.004 GHz
               ( +-  0.02% )
     1,384,720,988      stalled-cycles-frontend   #   46.52% frontend
cycles idle     ( +-  0.04% )
   <not supported>      stalled-cycles-backend
     4,105,948,264      instructions              #    1.38  insns per
cycle
                                                  #    0.34  stalled
cycles per insn  ( +-  0.00% )
       868,894,366      branches                  #  877.043 M/sec
               ( +-  0.00% )
        15,426,051      branch-misses             #    1.78% of all
branches          ( +-  0.01% )

       0.991619294 seconds time elapsed
          ( +-  0.06% )

The biggest difference that shows up is that lld has 1,152 context
switches, but the cpu utilization is still < 1. Maybe there is just a
threading bug somewhere?

>From your description, we build a hash of symbols in an archive and
for each undefined symbol in the overall link check if it is there. It
would probably be more efficient to walk the symbols defined in an
archive and check if it is needed by the overall link status, no? That
would save building a hash table for each archive.

One big difference of how lld works is the atom model. It basically
creates one Atom per symbol. That is inherently more work than what is
done by gold. IMHO it is confusing what atoms are and one way to
specify atoms in the object files.

It would be interesting to define an atom as the smallest thing that
cannot be split. It could still have multiple symbols in it for
example, and there would be no such thing as a AbsoltueAtom, just an
AbsoluteSymbol. In this model, the MachO reader would use symbols to
create atoms, but that is just one way to do it. The elf reader would
create 1 atom per regular section and special case SHF_MERGE and
.eh_frame (but we should really fix this one in LLVM too).

The atoms created in this way (for ELF at least) would be freely
movable, further reducing the cost.

Cheers,
Rafael