[llvm-dev] RFC: EfficiencySanitizer working set tool

Sean Silva via llvm-dev llvm-dev at lists.llvm.org
Tue Apr 19 23:48:37 PDT 2016


On Tue, Apr 19, 2016 at 10:44 PM, Derek Bruening via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Please reference the prior RFC on EfficiencySanitizer.  This is one of the
> performance analysis tools we would like to build under the
> EfficiencySanitizer umbrella.
>
> ====================
> Motivation
> ====================
>
> Knowing the working set size at periodic points during a given
> application's execution helps to understand its cache behavior, how its
> behavior changes over time, how its performance might vary on different
> hardware configurations, and how best to direct performance improvement
> efforts.  For example, knowing whether the working set is close to fitting
> in current L3 caches or is many times larger can help determine which
> efforts are most likely to bear fruit.
>
> The typical approach to acquire working set information is to collect a
> memory reference profile and feed it to a cache simulator, looking for the
> point at which the miss rate drops to zero.  However, this incurs
> prohibitively high runtime overhead.  Our goal is to build a fast shadow
> memory-based working set measurement tool.
>
> ====================
> Approach
> ====================
>
> We focus on the data working set size, partly because it is typically more
> important and partly because to measure the instruction working set with
> compiler instrumentation is much more difficult to implement, requiring
> insertion at a later and more challenging point in the compilation pipeline.
>
> Because we want to know how the program’s working set affects the cache
> behavior, we define the working set as the set of cache lines that are
> referenced by the program during a certain period of time.
>
> Using the definition above, we are able to use a single shadow bit to
> represent whether a cache line is referenced or not.  If the cache line
> size is 64 bytes, we will use one bit to represent a 64 byte cache line via
> a 512 byte to 1 byte shadow mapping.
>
> For every memory reference, we insert code to set the shadow bits
> corresponding to the cache line(s) touched by that reference.  The
> resulting shadow memory bitmap describes the program’s working set.
>
> There are two optimizations that can be added:
>
> 1) Add a shadow bit check before writing each bit to reduce the number of
> memory stores.  In our experience this is always a win with shadow memory
> tools.
>
> 2) Analyze the memory references within a basic block to coalesce multiple
> memory references into a single shadow update.
>

Will this fire often? Very few memory references are known to the compiler
to be >=cacheline aligned, so I'm not seeing how the compiler will prove
that two memory references definitely land on the same cacheline.

-- Sean Silva


> We will divide the program execution into regular intervals and record a
> snapshot of the working set at each interval boundary.  An adaptive
> strategy can keep the number of snapshots bounded across varying total
> execution time by combining adjacent snapshots via logical or.  When a
> snapshot is recorded, the shadow memory is cleared so that the next
> snapshot starts with a blank slate.  If we use a 64 byte to 1 byte shadow
> mapping, we can use the upper bits to store up to 8 consecutive snapshots
> in the shadow memory itself by shifting rather than clearing the shadow
> memory on a snapshot.
>
> Recording the snapshot will use optimizations to avoid storing for the
> entire address space: only mapped regions will be saved.
>
> The report from the tool to the user at the end of execution would
> essentially be a histogram with time on the x axis and the cache lines
> touched on the y axis.  The units of time could be selected by the user:
> cpu time, wall clock time, memory access count, etc.  The unit determines
> the method of snapshot sampling, whether performance counters for the
> memory access count (or instrumentation to increment a global counter) or
> an itimer.  We can record callstacks with each snapshot as well to help
> give an indication of what the program is doing at that point in time, to
> help the user understand the program phases.
>
> We expect the time overhead of the tool to be well under the 5x
> EfficiencySanitizer ceiling; presumably it should be under 3x.
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160419/78fc935a/attachment.html>


More information about the llvm-dev mailing list