[llvm-dev] RFC: EfficiencySanitizer working set tool

Wed Apr 20 20:39:38 PDT 2016

On Wed, Apr 20, 2016 at 5:51 PM, Qin Zhao <zhaoqin at google.com> wrote:

>
> On Apr 20, 2016 8:00 PM, "Sean Silva" <chisophugis at gmail.com> wrote:
> >
> >
> >
> > On Wed, Apr 20, 2016 at 8:14 AM, Qin Zhao <zhaoqin at google.com> wrote:
> >>
> >>
> >>
> >> On Wed, Apr 20, 2016 at 2:48 AM, Sean Silva <chisophugis at gmail.com>
> wrote:
> >>>
> >>>
> >>>
> >>> On Tue, Apr 19, 2016 at 10:44 PM, Derek Bruening via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >>>>
> >>>> Please reference the prior RFC on EfficiencySanitizer.  This is one
> of the performance analysis tools we would like to build under the
> EfficiencySanitizer umbrella.
> >>>>
> >>>> ====================
> >>>> Motivation
> >>>> ====================
> >>>>
> >>>> Knowing the working set size at periodic points during a given
> application's execution helps to understand its cache behavior, how its
> behavior changes over time, how its performance might vary on different
> hardware configurations, and how best to direct performance improvement
> efforts.  For example, knowing whether the working set is close to fitting
> in current L3 caches or is many times larger can help determine which
> efforts are most likely to bear fruit.
> >>>>
> >>>> The typical approach to acquire working set information is to collect
> a memory reference profile and feed it to a cache simulator, looking for
> the point at which the miss rate drops to zero.  However, this incurs
> prohibitively high runtime overhead.  Our goal is to build a fast shadow
> memory-based working set measurement tool.
> >>>>
> >>>> ====================
> >>>> Approach
> >>>> ====================
> >>>>
> >>>> We focus on the data working set size, partly because it is typically
> more important and partly because to measure the instruction working set
> with compiler instrumentation is much more difficult to implement,
> requiring insertion at a later and more challenging point in the
> compilation pipeline.
> >>>>
> >>>> Because we want to know how the program’s working set affects the
> cache behavior, we define the working set as the set of cache lines that
> are referenced by the program during a certain period of time.
> >>>>
> >>>> Using the definition above, we are able to use a single shadow bit to
> represent whether a cache line is referenced or not.  If the cache line
> size is 64 bytes, we will use one bit to represent a 64 byte cache line via
> a 512 byte to 1 byte shadow mapping.
> >>>>
> >>>> For every memory reference, we insert code to set the shadow bits
> corresponding to the cache line(s) touched by that reference.  The
> resulting shadow memory bitmap describes the program’s working set.
> >>>>
> >>>> There are two optimizations that can be added:
> >>>>
> >>>> 1) Add a shadow bit check before writing each bit to reduce the
> number of memory stores.  In our experience this is always a win with
> shadow memory tools.
> >>>>
> >>>> 2) Analyze the memory references within a basic block to coalesce
> multiple memory references into a single shadow update.
> >>>
> >>>
> >>> Will this fire often? Very few memory references are known to the
> compiler to be >=cacheline aligned, so I'm not seeing how the compiler will
> prove that two memory references definitely land on the same cacheline.
> >>>
> >>> -- Sean Silva
> >>
> >>
> >> I am not sure what you mean "fire often"?
> >> This is an optimization during the instrumentation, so it will be
> performed during the instrumentation pass.
> >
> >
> > I mean, I don't expect many opportunities during instrumentation where
> you can "coalesce multiple memory references into a single shadow update".
> The examples you give below are good, but I would have described them
> differently ("example 1" does not involve coalescing of the store to
> shadow, and "example 2" involves multiple shadow updates (which store is
> coalesced into which "single shadow update"?)).
> >
>
> Example 1 allows us to coalesce up to 4 consecutive cacheline operations
> into one since we operate at byte level. Though there might be few such
> opportunities.
>
Ah, that makes perfect sense (can e.g. just do an unaligned 4-byte
operation).

-- Sean Silva

> You are right, example 2 is more of skipping than coalescing.
>
> >>
> >> If you are asking about how many opportunities there, we do not know
> because we have not yet implemented it. And we probably won't implement it
> until we implement the basic instrumentation and evaluate the performance
> and result.
> >>
> >> But I think there are opportunities:
> >> example 1: if two references access address Addr, and Addr+64(cacheline
> size), then we only need one shadow address calculation for the first one,
> the second one's shadow address can be obtained with +1.
> >> example 2: if three references are: Addr, Addr+16, Addr+32, we can skip
> instrumentation for Addr+16 without prove which cacheline it belongs to.
> >
> >
> >
> > Thanks, these are good examples, although I wonder what it would take
> for the regular passes in the optimizer to optimize these opportunities
> appropriately (if they don't already).
> >
>
> My experience was more of dynamic instrumentation. So when I think about
> instrumentation, I think less about later optimization passes.
>
> The check-and-write optimization we added may split original basic block
> to multiple basic blocks, which may make an optimizer harder to perform
> optimization on instrumented code.
>
> For example 1, I think an optimizer might not be able to know how to
> coalescing those shadow updates.
> It is likely that I am wrong, we need implement those and see.
>
> > -- Sean Silva
> >
> >>
> >>
> >> -- Qin
> >>
> >>>
> >>>
> >>>>
> >>>> We will divide the program execution into regular intervals and
> record a snapshot of the working set at each interval boundary.  An
> adaptive strategy can keep the number of snapshots bounded across varying
> total execution time by combining adjacent snapshots via logical or.  When
> a snapshot is recorded, the shadow memory is cleared so that the next
> snapshot starts with a blank slate.  If we use a 64 byte to 1 byte shadow
> mapping, we can use the upper bits to store up to 8 consecutive snapshots
> in the shadow memory itself by shifting rather than clearing the shadow
> memory on a snapshot.
> >>>>
> >>>> Recording the snapshot will use optimizations to avoid storing for
> the entire address space: only mapped regions will be saved.
> >>>>
> >>>> The report from the tool to the user at the end of execution would
> essentially be a histogram with time on the x axis and the cache lines
> touched on the y axis.  The units of time could be selected by the user:
> cpu time, wall clock time, memory access count, etc.  The unit determines
> the method of snapshot sampling, whether performance counters for the
> memory access count (or instrumentation to increment a global counter) or
> an itimer.  We can record callstacks with each snapshot as well to help
> give an indication of what the program is doing at that point in time, to
> help the user understand the program phases.
> >>>>
> >>>> We expect the time overhead of the tool to be well under the 5x
> EfficiencySanitizer ceiling; presumably it should be under 3x.
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> LLVM Developers mailing list
> >>>> llvm-dev at lists.llvm.org
> >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>>>
> >>>
> >>> --
> >>> You received this message because you are subscribed to the Google
> Groups "efficiency-sanitizer" group.
> >>> To unsubscribe from this group and stop receiving emails from it, send
> an email to efficiency-sanitizer+unsubscribe at google.com.
> >>> To post to this group, send email to efficiency-sanitizer at google.com.
> >>> To view this discussion on the web visit
> https://groups.google.com/a/google.com/d/msgid/efficiency-sanitizer/CAHnXoa%3Dgv-V_mzDEK%3DE422%2B5nCpOX3-gHpScVJojk%3DVVq3Hw%3Dw%40mail.gmail.com
> .
> >>
> >>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160420/23e10e15/attachment.html>