[llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation tool

Fri Apr 22 22:50:24 PDT 2016

----- Original Message -----
> From: "Hal Finkel" <hfinkel at anl.gov>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>, efficiency-sanitizer at google.com, "Qin Zhao" <zhaoqin at google.com>
> Sent: Friday, April 22, 2016 11:53:24 PM
> Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation tool
> 
> ----- Original Message -----
> > From: "Hal Finkel via llvm-dev" <llvm-dev at lists.llvm.org>
> > To: "Qin Zhao" <zhaoqin at google.com>
> > Cc: "llvm-dev" <llvm-dev at lists.llvm.org>,
> > efficiency-sanitizer at google.com
> > Sent: Friday, April 22, 2016 10:43:44 PM
> > Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache
> > Fragmentation tool
> > 
> > ----- Original Message -----
> > > From: "Qin Zhao" <zhaoqin at google.com>
> > > To: "Hal Finkel" <hfinkel at anl.gov>, "Derek Bruening"
> > > <bruening at google.com>, efficiency-sanitizer at google.com
> > > Cc: "llvm-dev" <llvm-dev at lists.llvm.org>
> > > Sent: Friday, April 22, 2016 10:26:32 PM
> > > Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache
> > > Fragmentation tool
> > > 
> > > 
> > > Thanks for the comment and suggestions. That's a great idea!
> > > 
> > > 
> > > We actually thought about using each heap object with its type
> > > information for more accurate data, and it is definitely in our
> > > future plan.
> > > However, there are challenges to do so.
> > > For example, getting correct type information for each heap
> > > object
> > > might not be easy, especially for C programs. An application can
> > > even use a custom allocator and make things even harder.
> > > Please let me know if there is a way to get correct type
> > > information
> > > for heap object.
> > > There are also other issues like how to count union of two struct
> > > types.
> > > 
> > > 
> > > I want to collect good enough information with minimized
> > > overhead,
> > > so
> > > I need evaluate overhead and quality of the profile for each
> > > approach.
> > > Maybe the simple field counting approach is good enough. We need
> > > implement and evaluate.
> > 
> > Good point: At the deallocation site you generally can't see the
> > loads, so you don't have the type information there.
> > 
> > I think the best option here is to put this part of the lowering in
> > the frontend. The frontend can see the pointer type where the
> > delete
> > operator is called, and can generate the necessary 'accumulation'
> > function based on that. This will give a higher-quality result than
> > using IR-level structure fields for two reasons: 1) The frontend
> > knows how to ignore padding 2) The frontend knows the proper names
> > of the fields. The frontend can also deal in the a reasonable way
> > with unions (although I doubt you'll ever get a great solution
> > here).
> 
> To reiterate a point you made, however, this does not handle cases
> where a custom allocator creates space for some 'tail structure'. It
> is worse than that, however, because it does not do the useful thing
> for virtual base pointers in general. This probably does not matter
> so much for many of my use cases (where we have arrays of structures
> or basic types), but in some environments could be a significant
> issue. I believe that the virtual-base-pointer case is solvable by
> using RTTI (or some similar mechanism).

There's a possibly-elegant way to handle this problem: For classes with a virtual destructor, the code to sum the shadow-memory counts into field counts goes in the destructor. Otherwise, it gets injected in the call to the delete operator (or any explicit destructor call, which should catch a lot of the 'tail structure' cases too).

 -Hal

> I'm not yet sure how tail
> structures could be handled under this scheme.
> 
>  -Hal
> 
> > 
> > Thanks again,
> > Hal
> >  
> > > 
> > > -- Qin
> > > 
> > > 
> > > 
> > > 
> > > On Fri, Apr 22, 2016 at 8:38 PM, Hal Finkel < hfinkel at anl.gov >
> > > wrote:
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > From: "Hal Finkel" < hfinkel at anl.gov >
> > > To: "Hal Finkel" < hfinkel at anl.gov >
> > > Cc: "llvm-dev" < llvm-dev at lists.llvm.org >, "Qin Zhao" <
> > > zhaoqin at google.com >
> > > Sent: Friday, April 22, 2016 7:18:03 PM
> > > 
> > > 
> > > Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache
> > > Fragmentation
> > > tool
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > From: "Hal Finkel via llvm-dev" < llvm-dev at lists.llvm.org >
> > > To: "Qin Zhao" < zhaoqin at google.com >
> > > Cc: "llvm-dev" < llvm-dev at lists.llvm.org >
> > > Sent: Friday, April 22, 2016 7:13:38 PM
> > > Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache
> > > Fragmentation
> > > tool
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > From: "Qin Zhao via llvm-dev" < llvm-dev at lists.llvm.org >
> > > To: "llvm-dev" < llvm-dev at lists.llvm.org >
> > > Sent: Friday, April 22, 2016 4:59:00 PM
> > > Subject: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation
> > > tool
> > > 
> > > 
> > > 
> > > 
> > > Please reference the prior RFC on EfficiencySanitizer. This is
> > > one
> > > of
> > > the performance analysis tools we would like to build under the
> > > EfficiencySanitizer umbrella.
> > > 
> > > 
> > > ====================
> > > 
> > > Motivation
> > > 
> > > ====================
> > > 
> > > 
> > > An application is running sub-optimally if only part of the data
> > > brought into the cache is used, which we call cache
> > > fragmentation.
> > > Knowing the cache fragmentation information during a given
> > > application's execution helps developers to understand the
> > > application’s cache behavior and how best to direct performance
> > > optimization efforts. For example, developers may reorder the
> > > struct
> > > fields based on the cache fragmentation information and hopefully
> > > improve cache hit ration and performance.
> > > 
> > > 
> > > ====================
> > > 
> > > Approach
> > > 
> > > ====================
> > > 
> > > 
> > > We focus on two ways to get cache fragmentation information:
> > > 
> > >     1.
> > > 
> > > Struct field access patterns.
> > >     2.
> > > 
> > > Heap/global object access patterns.
> > > 
> > > 
> > > 
> > > =====================
> > > 
> > > Struct field access patterns
> > > 
> > > =====================
> > > 
> > >     1.
> > > 
> > > Get all the struct type information (e.g., via
> > > getIdentifiedStructTypes()), and create a counter for each field
> > > of
> > > each struct.
> > >     2.
> > > 
> > > Instrument each GEP (GetElementPtr) instruction if it operates on
> > > a
> > > struct type to update the corresponding field counter.
> > >     3.
> > > 
> > > At the program exit, filter and sort the struct field reference
> > > counters, and print the struct field hotness information for
> > > those
> > > structs deemed most likely to affect performance. The
> > > sorting/filtering metric could include disparity between fields:
> > > hot
> > > fields interleaved with cold fields, with a total access count
> > > high
> > > enough to matter.
> > > 
> > > 
> > > 
> > > There are a few potential problems with this simple approach:
> > > 
> > >     *
> > > 
> > > Overcount: a GEP instruction does not necessarily mean a field
> > > access.
> > >     *
> > > 
> > > Undercount: a GEP instruction may lead to multiple field
> > > accesses,
> > > especially if the address is passed to another function as an
> > > argument.
> > > 
> > > 
> > > I changed my mind; don't ignore what I said. But let me propose a
> > > slightly-different design:
> > > 
> > > 1. Just as you've described, collect counts in shadow memory for
> > > memory accesses.
> > > 2. For each relevant type, emit a function that knows how to
> > > collect
> > > counts from an array of that type into the struct-field counters.
> > > 3. When memory is freed, call the appropriate function to perform
> > > this accumulation.
> > > 4. If desired, also collect counts from stack memory (as I
> > > indicated
> > > below, although we'll need to call all of the relevant summation
> > > functions, and this could get expensive if they're not inlined).
> > > 
> > > In this way, we'll have no over-counting / under-counting
> > > problems;
> > > plus this is easily extensible to collecting per-allocation
> > > struct-field access stats (which is often desired). We can get
> > > everything at the same time, and might actually be less expensive
> > > when running on multiple cores (because they'll be less
> > > cache-line
> > > contention for the struct-field counters).
> > > 
> > > Thanks again,
> > > Hal
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > You can ignore my message; I misread the distinction here.
> > > 
> > > -Hal
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > >     *
> > > 
> > > 
> > > I can't tell from your description, but it sounds like you might
> > > also
> > > undercount accesses from escaped addresses (because these are
> > > later
> > > indistinguishable from heap accesses)?
> > > 
> > > 
> > > 
> > > 
> > > 
> > >     *
> > > 
> > > 
> > >     *
> > > 
> > > Racy update by multiple threads.
> > > 
> > > 
> > > 
> > > We want to keep the instrumentation simple in our initial
> > > implementation for both robustness and performance reasons, so we
> > > will defer any analysis (e.g., data flow analysis) to later
> > > stages.
> > > Any suggestions on how to improve the accuracy are welcome. I
> > > don't
> > > understand why you're using a separate mechanism here from what
> > > is
> > > being used for heap accesses. Why don't you use the same
> > > shadow-memory scheme here as you do for heap accesses (especially
> > > considering that escaped stack addresses will be counted this way
> > > anyway), and then upon function exit, collect the counts from the
> > > local stack? I think the necessary region is:
> > > 
> > > [@llvm.frameaddress(0), @llvm.frameaddress(0) +
> > > @llvm.get.dynamic.area.offset())
> > > 
> > > or you can call @llvm.stacksave() upon entry and use that as the
> > > base
> > > offset.
> > > 
> > > -Hal
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > There is one simple improvement we may want to explore: the
> > > temporal
> > > locality of struct field accesses.
> > > 
> > > Two struct fields being hot (i.e., frequently accessed) does not
> > > necessarily mean they are accessed together. We want to know the
> > > affinity among those struct fields, which could be determined via
> > > a
> > > sampling approach: track which fields are accessed together
> > > during
> > > the last period at each sample, and update an affinity table for
> > > the
> > > final report.
> > > 
> > > 
> > > We expect the time overhead of the tool to be well under the 5x
> > > EfficiencySanitizer ceiling; presumably it should be under 2x.
> > > 
> > > 
> > > ===========================
> > > 
> > > Heap/global object access patterns
> > > 
> > > ===========================
> > > 
> > > 
> > > We plan to use shadow memory and sampling to keep track of
> > > heap/global object accesses.
> > > Shadow memory:
> > > We use a 4byte-to-1byte shadow mapping. Each application word is
> > > mapped to a
> > > 
> > >     *
> > > 
> > > shadow byte, and so a 64-byte cache line is mapped to a 16-byte
> > > shadow memory. In each shadow byte, the highest bit is used for
> > > indicating whether the corresponding application word is
> > > accessed,
> > > and the other 7 bits are used as a counter for the hotness of the
> > > application word.
> > >     *
> > > 
> > > Instrumentation: On every memory reference, the instrumented code
> > > simply checks if the highest bit is set. If not, the code sets it
> > > using an OR operation. We will live with races in updating shadow
> > > memory bits.
> > >     *
> > > 
> > > Sampling: On each sample we scan the shadow memory. If the
> > > highest
> > > bit of a shadow byte is set, we increment the 7-bit counter (to
> > > the
> > > maximum of 127; if this is found to be too small we could use
> > > separate storage for an additional counter for hot fields).
> > >     *
> > > 
> > > Memory allocation wrapping: When a heap object is freed, we
> > > acquire
> > > the callstack and its access pattern in the shadow memory. We may
> > > coalesce them based on the allocation/free callstack.
> > > 
> > > 
> > > 
> > > The report from the tool to the user at the end of execution
> > > would
> > > essentially be a list of objects that have some significant
> > > fragmented access pattern. We expect the time overhead of the
> > > tool
> > > to be well under the 5x EfficiencySanitizer ceiling; presumably
> > > it
> > > should be under 3x.
> > > 
> > > 
> > > We plan to implement both struct field access tracking and shadow
> > > based heap/global object access tracking. In our initial
> > > implementation, we plan to provide both results to developers
> > > simultaneously.
> > > 
> > > 
> > > There are a number of alternative approaches that could be
> > > explored,
> > > including a 4byte:4byte 32-bit hotness counter per 4-byte field,
> > > or
> > > a 1byte:1bit bitmap for field byte granularity with sampling.
> > > Extensions to the proposals above could also be explored in the
> > > future, such as combining the struct and shadow modes for better
> > > results. Additionally, we may use the 7 shadow bits differently
> > > to
> > > track temporal locality information instead. Any suggestions are
> > > also welcome.
> > > 
> > > 
> > > 
> > > 
> > > -- Qin
> > > 
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > llvm-dev at lists.llvm.org
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > > 
> > > 
> > > 
> > > --
> > > 
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > > 
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > llvm-dev at lists.llvm.org
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > > 
> > > 
> > > 
> > > --
> > > 
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > > 
> > > 
> > > 
> > > --
> > > 
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > > 
> > > 
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory