[llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation tool

Fri Apr 22 14:59:00 PDT 2016

Please reference the prior RFC on EfficiencySanitizer.  This is one of the
performance analysis tools we would like to build under the
EfficiencySanitizer umbrella.

====================

Motivation

====================

An application is running sub-optimally if only part of the data brought
into the cache is used, which we call cache fragmentation.  Knowing the
cache fragmentation information during a given application's execution
helps developers to understand the application’s cache behavior and how
best to direct performance optimization efforts. For example, developers
may reorder the struct fields based on the cache fragmentation information
and hopefully improve cache hit ration and performance.

====================

Approach

====================

We focus on two ways to get cache fragmentation information:

   1.

   Struct field access patterns.
   2.

   Heap/global object access patterns.

=====================

Struct field access patterns

=====================

   1.

   Get all the struct type information (e.g., via
   getIdentifiedStructTypes()), and create a counter for each field of each
   struct.
   2.

   Instrument each GEP (GetElementPtr) instruction if it operates on a
   struct type to update the corresponding field counter.
   3.

   At the program exit, filter and sort the struct field reference
   counters, and print the struct field hotness information for those structs
   deemed most likely to affect performance.  The sorting/filtering metric
   could include disparity between fields: hot fields interleaved with cold
   fields, with a total access count high enough to matter.

There are a few potential problems with this simple approach:

   -

   Overcount: a GEP instruction does not necessarily mean a field access.
   -

   Undercount: a GEP instruction may lead to multiple field accesses,
   especially if the address is passed to another function as an argument.
   -

   Racy update by multiple threads.

We want to keep the instrumentation simple in our initial implementation
for both robustness and performance reasons, so we will defer any analysis
(e.g., data flow analysis) to later stages. Any suggestions on how to
improve the accuracy are welcome.

There is one simple improvement we may want to explore: the temporal
locality of struct field accesses.

Two struct fields being hot (i.e., frequently accessed) does not
necessarily mean they are accessed together. We want to know the affinity
among those struct fields, which could be determined via a sampling
approach: track which fields are accessed together during the last period
at each sample, and update an affinity table for the final report.

We expect the time overhead of the tool to be well under the 5x
EfficiencySanitizer ceiling; presumably it should be under 2x.

===========================

Heap/global object access patterns

===========================

We plan to use shadow memory and sampling to keep track of heap/global
object accesses.

Shadow memory:
We use a 4byte-to-1byte shadow mapping.  Each application word is mapped to
a

   -

   shadow byte, and so a 64-byte cache line is mapped to a 16-byte shadow
   memory. In each shadow byte, the highest bit is used for indicating whether
   the corresponding application word is accessed, and the other 7 bits are
   used as a counter for the hotness of the application word.

   -

   Instrumentation:
   On every memory reference, the instrumented code simply checks if the
   highest bit is set. If not, the code sets it using an OR operation. We will
   live with races in updating shadow memory bits.
   -

   Sampling:
   On each sample we scan the shadow memory. If the highest bit of a shadow
   byte is set, we increment the 7-bit counter (to the maximum of 127; if this
   is found to be too small we could use separate storage for an additional
   counter for hot fields).
   -

   Memory allocation wrapping:
   When a heap object is freed, we acquire the callstack and its access
   pattern in the shadow memory. We may coalesce them based on the
   allocation/free callstack.

The report from the tool to the user at the end of execution would
essentially be a list of objects that have some significant fragmented
access pattern. We expect the time overhead of the tool to be well under
the 5x EfficiencySanitizer ceiling; presumably it should be under 3x.

We plan to implement both struct field access tracking and shadow based
heap/global object access tracking. In our initial implementation, we plan
to provide both results to developers simultaneously.

There are a number of alternative approaches that could be explored,
including a 4byte:4byte 32-bit hotness counter per 4-byte field, or a
1byte:1bit bitmap for field byte granularity with sampling.  Extensions to
the proposals above could also be explored in the future, such as combining
the struct and shadow modes for better results. Additionally, we may use
the 7 shadow bits differently to track temporal locality information
instead. Any suggestions are also welcome.

-- Qin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160422/7f25658f/attachment-0001.html>