[LLVMdev] On LLD performance

Fri Mar 13 09:38:41 PDT 2015

Rafael,

This is very good information and extremely useful.

On 3/12/2015 11:49 AM, Rafael Espíndola wrote:
> I tried benchmarking it on linux by linking clang Release+asserts (but
> lld itself with no asserts). The first things I noticed were:
>
> missing options:
>
> warning: ignoring unknown argument: --no-add-needed
> warning: ignoring unknown argument: -O3
> warning: ignoring unknown argument: --gc-sections
>
> I just removed them from the command line.
>
> Looks like --hash-style=gnu and --build-id are just ignored, so I
> removed them too.
>
> Looks like --strip-all is ignored, so I removed and ran strip manually.
>
> Looks like .note.GNU-stack is incorrectly added, neither gold nor
> bfd.ld adds it for clang.
>
> Looks like .gnu.version and .gnu.version_r are not implemented.
>
> Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
> it is not included in .got.
I have a fix for this. Will merge it.
>
> Gold produces a .data.rel.ro.local. lld produces a .data.rel.local.
> bfd puts everything in .data.rel. I have to research a bit to find out
> what this is. For now I just added the sizes into a single entry.
>
> .eh_frame_hdr is effectively empty on lld. I removed --eh-frame-hdr
> from the command line.
>
> With all that, the sections that increased in size the most when using lld were:
>
> .rodata: 9 449 278 bytes bigger
> .eh_frame: 438 376 bytes bigger
> .comment: 77 797 bytes bigger
> .data.rel.ro: 48 056 bytes bigger
Did you try --merge-strings with lld ? --gc-sections
>
> The comment section is bigger because it has multiple copies of
>
> clang version 3.7.0 (trunk 232021) (llvm/trunk 232027)
>
> The lack of duplicate entry merging would also explain the size
> difference of .rodata and .eh_frame. No idea why .data.rel.ro is
> bigger.
>
> So, with the big warning that both linkers are not doing exactly the
> same thing, the performance numbers I got were:
>
> lld:
>
>
>         1961.842991      task-clock (msec)         #    0.999 CPUs
> utilized            ( +-  0.04% )
>               1,152      context-switches          #    0.587 K/sec
>                   0      cpu-migrations            #    0.000 K/sec
>                 ( +-100.00% )
>             199,310      page-faults               #    0.102 M/sec
>                 ( +-  0.00% )
>       5,893,291,145      cycles                    #    3.004 GHz
>                 ( +-  0.03% )
>       3,329,741,079      stalled-cycles-frontend   #   56.50% frontend
> cycles idle     ( +-  0.05% )
>     <not supported>      stalled-cycles-backend
>       6,255,727,902      instructions              #    1.06  insns per
> cycle
>                                                    #    0.53  stalled
> cycles per insn  ( +-  0.01% )
>       1,295,893,191      branches                  #  660.549 M/sec
>                 ( +-  0.01% )
>          26,760,734      branch-misses             #    2.07% of all
> branches          ( +-  0.01% )
>
>         1.963705923 seconds time elapsed
>            ( +-  0.04% )
>
> gold:
>
>          990.708786      task-clock (msec)         #    0.999 CPUs
> utilized            ( +-  0.06% )
>                   0      context-switches          #    0.000 K/sec
>                   0      cpu-migrations            #    0.000 K/sec
>                 ( +-100.00% )
>              77,840      page-faults               #    0.079 M/sec
>       2,976,552,629      cycles                    #    3.004 GHz
>                 ( +-  0.02% )
>       1,384,720,988      stalled-cycles-frontend   #   46.52% frontend
> cycles idle     ( +-  0.04% )
>     <not supported>      stalled-cycles-backend
>       4,105,948,264      instructions              #    1.38  insns per
> cycle
>                                                    #    0.34  stalled
> cycles per insn  ( +-  0.00% )
>         868,894,366      branches                  #  877.043 M/sec
>                 ( +-  0.00% )
>          15,426,051      branch-misses             #    1.78% of all
> branches          ( +-  0.01% )
>
>         0.991619294 seconds time elapsed
>            ( +-  0.06% )
>
>
> The biggest difference that shows up is that lld has 1,152 context
> switches, but the cpu utilization is still < 1. Maybe there is just a
> threading bug somewhere?
lld apparently is highly multithreaded, but I see your point.  May be 
trying to do this exercise on /dev/shm can show more cpu utilization ?

Shankar Easwaran

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the Linux Foundation