[LLVMdev] On LLD performance
Shankar Easwaran
shankare at codeaurora.org
Fri Mar 13 09:38:41 PDT 2015
Rafael,
This is very good information and extremely useful.
On 3/12/2015 11:49 AM, Rafael EspĂndola wrote:
> I tried benchmarking it on linux by linking clang Release+asserts (but
> lld itself with no asserts). The first things I noticed were:
>
> missing options:
>
> warning: ignoring unknown argument: --no-add-needed
> warning: ignoring unknown argument: -O3
> warning: ignoring unknown argument: --gc-sections
>
> I just removed them from the command line.
>
> Looks like --hash-style=gnu and --build-id are just ignored, so I
> removed them too.
>
> Looks like --strip-all is ignored, so I removed and ran strip manually.
>
> Looks like .note.GNU-stack is incorrectly added, neither gold nor
> bfd.ld adds it for clang.
>
> Looks like .gnu.version and .gnu.version_r are not implemented.
>
> Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
> it is not included in .got.
I have a fix for this. Will merge it.
>
> Gold produces a .data.rel.ro.local. lld produces a .data.rel.local.
> bfd puts everything in .data.rel. I have to research a bit to find out
> what this is. For now I just added the sizes into a single entry.
>
> .eh_frame_hdr is effectively empty on lld. I removed --eh-frame-hdr
> from the command line.
>
> With all that, the sections that increased in size the most when using lld were:
>
> .rodata: 9 449 278 bytes bigger
> .eh_frame: 438 376 bytes bigger
> .comment: 77 797 bytes bigger
> .data.rel.ro: 48 056 bytes bigger
Did you try --merge-strings with lld ? --gc-sections
>
> The comment section is bigger because it has multiple copies of
>
> clang version 3.7.0 (trunk 232021) (llvm/trunk 232027)
>
> The lack of duplicate entry merging would also explain the size
> difference of .rodata and .eh_frame. No idea why .data.rel.ro is
> bigger.
>
> So, with the big warning that both linkers are not doing exactly the
> same thing, the performance numbers I got were:
>
> lld:
>
>
> 1961.842991 task-clock (msec) # 0.999 CPUs
> utilized ( +- 0.04% )
> 1,152 context-switches # 0.587 K/sec
> 0 cpu-migrations # 0.000 K/sec
> ( +-100.00% )
> 199,310 page-faults # 0.102 M/sec
> ( +- 0.00% )
> 5,893,291,145 cycles # 3.004 GHz
> ( +- 0.03% )
> 3,329,741,079 stalled-cycles-frontend # 56.50% frontend
> cycles idle ( +- 0.05% )
> <not supported> stalled-cycles-backend
> 6,255,727,902 instructions # 1.06 insns per
> cycle
> # 0.53 stalled
> cycles per insn ( +- 0.01% )
> 1,295,893,191 branches # 660.549 M/sec
> ( +- 0.01% )
> 26,760,734 branch-misses # 2.07% of all
> branches ( +- 0.01% )
>
> 1.963705923 seconds time elapsed
> ( +- 0.04% )
>
> gold:
>
> 990.708786 task-clock (msec) # 0.999 CPUs
> utilized ( +- 0.06% )
> 0 context-switches # 0.000 K/sec
> 0 cpu-migrations # 0.000 K/sec
> ( +-100.00% )
> 77,840 page-faults # 0.079 M/sec
> 2,976,552,629 cycles # 3.004 GHz
> ( +- 0.02% )
> 1,384,720,988 stalled-cycles-frontend # 46.52% frontend
> cycles idle ( +- 0.04% )
> <not supported> stalled-cycles-backend
> 4,105,948,264 instructions # 1.38 insns per
> cycle
> # 0.34 stalled
> cycles per insn ( +- 0.00% )
> 868,894,366 branches # 877.043 M/sec
> ( +- 0.00% )
> 15,426,051 branch-misses # 1.78% of all
> branches ( +- 0.01% )
>
> 0.991619294 seconds time elapsed
> ( +- 0.06% )
>
>
> The biggest difference that shows up is that lld has 1,152 context
> switches, but the cpu utilization is still < 1. Maybe there is just a
> threading bug somewhere?
lld apparently is highly multithreaded, but I see your point. May be
trying to do this exercise on /dev/shm can show more cpu utilization ?
Shankar Easwaran
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the Linux Foundation
More information about the llvm-dev
mailing list