[llvm-dev] A couple metrics of LLD/ELF's performance

Sun Nov 27 03:23:17 PST 2016

These numbers were collected on Rafael's clang-fsds test case (however, I
removed -O3 and --gc-sections) with a command like:
```
sudo perf record --event=cache-misses --call-graph=dwarf --
/home/sean/pg/llvm/release/bin/ld.lld @response.txt -o /tmp/t --no-threads
```

And then
```
sudo perf report --no-children --sort dso,srcfile
```

One annoying thing about these numbers from perf is that they don't sum to
100% usually; so just treat the numbers as relative to each other. Overall
I'm not very happy with perf. I don't fully trust its output.
Also, keep in mind that clang-fsds doesn't have debug info, so the heavy
string handling costs don't show up in this profile.

--event=cycles
This is the perf default and correlates with overall runtime. One
interesting thing this shows is that LLD is currently quite bottlenecked on
the kernel.
https://reviews.llvm.org/P7944

These other metrics are harder to improve. Improving these metrics will
require macro-scale optimizations to our data structures and IO. This means
that we should be aware of them so that we avoid going into a local minimum
of performance.

--event=cache-misses
I believe these are L2 misses. getOffset shows up here quite a bit.
One useful purpose for this metric is that since L2 is core-private (my CPU
is an i7-6700HQ, but this will apply to all recent big intel cores), it
won't contend with other cores for the L3 cache. So misses here are where
cores start to feel each other's presence.
https://reviews.llvm.org/P7943

--event=LLC-load-misses
These are misses in last level cache (LLC). I.e. times that we have to go
to DRAM (SLOOOW).
The getVA codepath show up strongly and we see the memcpy into the output.
We may want to consider a nontemporal memcpy to at least avoid polluting
the cache.
These misses contend on the DRAM bus (although currently it may be
underutilized and so adding more parallelism will help to keep it busy, but
only up to a point).
https://reviews.llvm.org/P7947

--event=dTLB-load-misses
These are dTLB misses for loads (on my machine, it corresponds to any time
that the hardware page table walker kicks in:
https://github.com/torvalds/linux/blob/f92b7604149a55cb601fc0b52911b1e11f0f2514/arch/x86/events/intel/core.c#L434
).
Here we also see the getVA codepath (which is basically doing a random
lookup into a huge hash table, so it will DTLB miss) and the memcpy into
the output.
https://reviews.llvm.org/P7945

--event=minor-faults
This metric essentially shows where new pages of memory are touched and
have to be either allocated by the kernel or it has to do a page table
fixup.
Here we see the memcpy into the output is a huge part. Also obviously lots
of minor faults as malloc allocates memory from the kernel.
https://reviews.llvm.org/P7946

-- Sean Silva
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161127/0f23f8df/attachment.html>