[llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation tool

Fri Apr 22 17:18:03 PDT 2016

----- Original Message -----

> From: "Hal Finkel via llvm-dev" <llvm-dev at lists.llvm.org>
> To: "Qin Zhao" <zhaoqin at google.com>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Friday, April 22, 2016 7:13:38 PM
> Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation
> tool

> ----- Original Message -----

> > From: "Qin Zhao via llvm-dev" <llvm-dev at lists.llvm.org>
> 
> > To: "llvm-dev" <llvm-dev at lists.llvm.org>
> 
> > Sent: Friday, April 22, 2016 4:59:00 PM
> 
> > Subject: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation
> > tool
> 

> > Please reference the prior RFC on EfficiencySanitizer. This is one
> > of
> > the performance analysis tools we would like to build under the
> > EfficiencySanitizer umbrella.
> 

> > ====================
> 
> > Motivation
> 
> > ====================
> 

> > An application is running sub-optimally if only part of the data
> > brought into the cache is used, which we call cache fragmentation.
> > Knowing the cache fragmentation information during a given
> > application's execution helps developers to understand the
> > application’s cache behavior and how best to direct performance
> > optimization efforts. For example, developers may reorder the
> > struct
> > fields based on the cache fragmentation information and hopefully
> > improve cache hit ration and performance.
> 

> > ====================
> 
> > Approach
> 
> > ====================
> 

> > We focus on two ways to get cache fragmentation information:
> 

> > 1.
> 
> > Struct field access patterns.
> 
> > 2.
> 
> > Heap/global object access patterns.
> 

> > =====================
> 
> > Struct field access patterns
> 
> > =====================
> 

> > 1.
> 
> > Get all the struct type information (e.g., via
> > getIdentifiedStructTypes()), and create a counter for each field of
> > each struct.
> 
> > 2.
> 
> > Instrument each GEP (GetElementPtr) instruction if it operates on a
> > struct type to update the corresponding field counter.
> 
> > 3.
> 
> > At the program exit, filter and sort the struct field reference
> > counters, and print the struct field hotness information for those
> > structs deemed most likely to affect performance. The
> > sorting/filtering metric could include disparity between fields:
> > hot
> > fields interleaved with cold fields, with a total access count high
> > enough to matter.
> 

> > There are a few potential problems with this simple approach:
> 

> > *
> 
> > Overcount: a GEP instruction does not necessarily mean a field
> > access.
> 
> > *
> 
> > Undercount: a GEP instruction may lead to multiple field accesses,
> > especially if the address is passed to another function as an
> > argument.
> 

You can ignore my message; I misread the distinction here. 

-Hal 

> > *
> 

> I can't tell from your description, but it sounds like you might also
> undercount accesses from escaped addresses (because these are later
> indistinguishable from heap accesses)?

> > *
> 

> > *
> 
> > Racy update by multiple threads.
> 

> > We want to keep the instrumentation simple in our initial
> > implementation for both robustness and performance reasons, so we
> > will defer any analysis (e.g., data flow analysis) to later stages.
> > Any suggestions on how to improve the accuracy are welcome.
> 
> I don't understand why you're using a separate mechanism here from
> what is being used for heap accesses. Why don't you use the same
> shadow-memory scheme here as you do for heap accesses (especially
> considering that escaped stack addresses will be counted this way
> anyway), and then upon function exit, collect the counts from the
> local stack? I think the necessary region is:

> [@llvm.frameaddress(0), @llvm.frameaddress(0) +
> @llvm.get.dynamic.area.offset())

> or you can call @llvm.stacksave() upon entry and use that as the base
> offset.

> -Hal

> > There is one simple improvement we may want to explore: the
> > temporal
> > locality of struct field accesses.
> 
> > Two struct fields being hot (i.e., frequently accessed) does not
> > necessarily mean they are accessed together. We want to know the
> > affinity among those struct fields, which could be determined via a
> > sampling approach: track which fields are accessed together during
> > the last period at each sample, and update an affinity table for
> > the
> > final report.
> 

> > We expect the time overhead of the tool to be well under the 5x
> > EfficiencySanitizer ceiling; presumably it should be under 2x.
> 

> > ===========================
> 
> > Heap/global object access patterns
> 
> > ===========================
> 

> > We plan to use shadow memory and sampling to keep track of
> > heap/global object accesses.
> 
> > Shadow memory:
> 
> > We use a 4byte-to-1byte shadow mapping. Each application word is
> > mapped to a
> 

> > *
> 
> > shadow byte, and so a 64-byte cache line is mapped to a 16-byte
> > shadow memory. In each shadow byte, the highest bit is used for
> > indicating whether the corresponding application word is accessed,
> > and the other 7 bits are used as a counter for the hotness of the
> > application word.
> 
> > *
> 
> > Instrumentation: On every memory reference, the instrumented code
> > simply checks if the highest bit is set. If not, the code sets it
> > using an OR operation. We will live with races in updating shadow
> > memory bits.
> 
> > *
> 
> > Sampling: On each sample we scan the shadow memory. If the highest
> > bit of a shadow byte is set, we increment the 7-bit counter (to the
> > maximum of 127; if this is found to be too small we could use
> > separate storage for an additional counter for hot fields).
> 
> > *
> 
> > Memory allocation wrapping: When a heap object is freed, we acquire
> > the callstack and its access pattern in the shadow memory. We may
> > coalesce them based on the allocation/free callstack.
> 

> > The report from the tool to the user at the end of execution would
> > essentially be a list of objects that have some significant
> > fragmented access pattern. We expect the time overhead of the tool
> > to be well under the 5x EfficiencySanitizer ceiling; presumably it
> > should be under 3x.
> 

> > We plan to implement both struct field access tracking and shadow
> > based heap/global object access tracking. In our initial
> > implementation, we plan to provide both results to developers
> > simultaneously.
> 

> > There are a number of alternative approaches that could be
> > explored,
> > including a 4byte:4byte 32-bit hotness counter per 4-byte field, or
> > a 1byte:1bit bitmap for field byte granularity with sampling.
> > Extensions to the proposals above could also be explored in the
> > future, such as combining the struct and shadow modes for better
> > results. Additionally, we may use the 7 shadow bits differently to
> > track temporal locality information instead. Any suggestions are
> > also welcome.
> 

> > -- Qin
> 

> > _______________________________________________
> 
> > LLVM Developers mailing list
> 
> > llvm-dev at lists.llvm.org
> 
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 

> --

> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory

> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 

Hal Finkel 
Assistant Computational Scientist 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160422/d57f4278/attachment-0001.html>