[llvm-dev] RFC: EfficiencySanitizer working set tool

Wed Apr 20 09:09:13 PDT 2016

On Wed, Apr 20, 2016 at 7:44 AM, Derek Bruening via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> Please reference the prior RFC on EfficiencySanitizer.  This is one of the
> performance analysis tools we would like to build under the
> EfficiencySanitizer umbrella.
>
> ====================
> Motivation
> ====================
>
> Knowing the working set size at periodic points during a given application's
> execution helps to understand its cache behavior, how its behavior changes
> over time, how its performance might vary on different hardware
> configurations, and how best to direct performance improvement efforts.  For
> example, knowing whether the working set is close to fitting in current L3
> caches or is many times larger can help determine which efforts are most
> likely to bear fruit.
>
> The typical approach to acquire working set information is to collect a
> memory reference profile and feed it to a cache simulator, looking for the
> point at which the miss rate drops to zero.  However, this incurs
> prohibitively high runtime overhead.  Our goal is to build a fast shadow
> memory-based working set measurement tool.
>
> ====================
> Approach
> ====================
>
> We focus on the data working set size, partly because it is typically more
> important and partly because to measure the instruction working set with
> compiler instrumentation is much more difficult to implement, requiring
> insertion at a later and more challenging point in the compilation pipeline.
>
> Because we want to know how the program’s working set affects the cache
> behavior, we define the working set as the set of cache lines that are
> referenced by the program during a certain period of time.
>
> Using the definition above, we are able to use a single shadow bit to
> represent whether a cache line is referenced or not.  If the cache line size
> is 64 bytes, we will use one bit to represent a 64 byte cache line via a 512
> byte to 1 byte shadow mapping.
I've raised the concern about problems with this approach in
multithreaded environment, but this particular tool is a good example,
so we can discuss those problems here.
Your approach suggests storing the state of 512 cache lines in a
single shadow cache line. So, if I'm understanding the algorithm
correctly, parallel accesses to 512 adjacent cache lines from
different CPUs will cause unnecessary contention on that shadow cache
line, which will presumably degrade the application's performance.
Also note that because we don't have atomic bitwise operations,
updates of shadow bytes will require a CAS loop.
I'm not sure about the performance impact of the above issues, but
there might be a tradeoff between the shadow memory scale and the
slowdown.

> For every memory reference, we insert code to set the shadow bits
> corresponding to the cache line(s) touched by that reference.  The resulting
> shadow memory bitmap describes the program’s working set.
>
> There are two optimizations that can be added:
>
> 1) Add a shadow bit check before writing each bit to reduce the number of
> memory stores.  In our experience this is always a win with shadow memory
> tools.
>
> 2) Analyze the memory references within a basic block to coalesce multiple
> memory references into a single shadow update.
>
> We will divide the program execution into regular intervals and record a
> snapshot of the working set at each interval boundary.  An adaptive strategy
> can keep the number of snapshots bounded across varying total execution time
> by combining adjacent snapshots via logical or.  When a snapshot is
> recorded, the shadow memory is cleared so that the next snapshot starts with
> a blank slate.  If we use a 64 byte to 1 byte shadow mapping, we can use the
> upper bits to store up to 8 consecutive snapshots in the shadow memory
> itself by shifting rather than clearing the shadow memory on a snapshot.
>
> Recording the snapshot will use optimizations to avoid storing for the
> entire address space: only mapped regions will be saved.
>
> The report from the tool to the user at the end of execution would
> essentially be a histogram with time on the x axis and the cache lines
> touched on the y axis.  The units of time could be selected by the user: cpu
> time, wall clock time, memory access count, etc.  The unit determines the
> method of snapshot sampling, whether performance counters for the memory
> access count (or instrumentation to increment a global counter) or an
> itimer.  We can record callstacks with each snapshot as well to help give an
> indication of what the program is doing at that point in time, to help the
> user understand the program phases.
>
> We expect the time overhead of the tool to be well under the 5x
> EfficiencySanitizer ceiling; presumably it should be under 3x.
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>


-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg