[llvm-dev] RFC: EfficiencySanitizer working set tool
Derek Bruening via llvm-dev
llvm-dev at lists.llvm.org
Tue Apr 19 22:44:01 PDT 2016
Please reference the prior RFC on EfficiencySanitizer. This is one of the
performance analysis tools we would like to build under the
Knowing the working set size at periodic points during a given
application's execution helps to understand its cache behavior, how its
behavior changes over time, how its performance might vary on different
hardware configurations, and how best to direct performance improvement
efforts. For example, knowing whether the working set is close to fitting
in current L3 caches or is many times larger can help determine which
efforts are most likely to bear fruit.
The typical approach to acquire working set information is to collect a
memory reference profile and feed it to a cache simulator, looking for the
point at which the miss rate drops to zero. However, this incurs
prohibitively high runtime overhead. Our goal is to build a fast shadow
memory-based working set measurement tool.
We focus on the data working set size, partly because it is typically more
important and partly because to measure the instruction working set with
compiler instrumentation is much more difficult to implement, requiring
insertion at a later and more challenging point in the compilation pipeline.
Because we want to know how the program’s working set affects the cache
behavior, we define the working set as the set of cache lines that are
referenced by the program during a certain period of time.
Using the definition above, we are able to use a single shadow bit to
represent whether a cache line is referenced or not. If the cache line
size is 64 bytes, we will use one bit to represent a 64 byte cache line via
a 512 byte to 1 byte shadow mapping.
For every memory reference, we insert code to set the shadow bits
corresponding to the cache line(s) touched by that reference. The
resulting shadow memory bitmap describes the program’s working set.
There are two optimizations that can be added:
1) Add a shadow bit check before writing each bit to reduce the number of
memory stores. In our experience this is always a win with shadow memory
2) Analyze the memory references within a basic block to coalesce multiple
memory references into a single shadow update.
We will divide the program execution into regular intervals and record a
snapshot of the working set at each interval boundary. An adaptive
strategy can keep the number of snapshots bounded across varying total
execution time by combining adjacent snapshots via logical or. When a
snapshot is recorded, the shadow memory is cleared so that the next
snapshot starts with a blank slate. If we use a 64 byte to 1 byte shadow
mapping, we can use the upper bits to store up to 8 consecutive snapshots
in the shadow memory itself by shifting rather than clearing the shadow
memory on a snapshot.
Recording the snapshot will use optimizations to avoid storing for the
entire address space: only mapped regions will be saved.
The report from the tool to the user at the end of execution would
essentially be a histogram with time on the x axis and the cache lines
touched on the y axis. The units of time could be selected by the user:
cpu time, wall clock time, memory access count, etc. The unit determines
the method of snapshot sampling, whether performance counters for the
memory access count (or instrumentation to increment a global counter) or
an itimer. We can record callstacks with each snapshot as well to help
give an indication of what the program is doing at that point in time, to
help the user understand the program phases.
We expect the time overhead of the tool to be well under the 5x
EfficiencySanitizer ceiling; presumably it should be under 3x.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev