<div dir="ltr"><span id="gmail-docs-internal-guid-f41ea31d-3ff9-ba6a-e281-89d6293ce8db"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">Please reference the prior RFC on EfficiencySanitizer.  This is one of the performance analysis tools we would like to build under the EfficiencySanitizer umbrella.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">====================</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">Motivation</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">====================</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">An application is running sub-optimally if only part of the data brought into the cache is used, which we call cache fragmentation.  Knowing the cache fragmentation information during a given application's execution helps developers to understand the application’s cache behavior and how best to direct performance optimization efforts. </span><span style="font-size:14.6667px;font-family:arial;vertical-align:baseline;white-space:pre-wrap">For example, developers may reorder the struct fields based on the cache fragmentation information and hopefully improve cache hit ration and performance.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">====================</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">Approach</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">====================</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">We focus on two ways to get cache fragmentation information:</span></p><ol style="margin-top:0pt;margin-bottom:0pt"><li dir="ltr" style="list-style-type:decimal;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap">Struct field access patterns.</span></p></li><li dir="ltr" style="list-style-type:decimal;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap">Heap/global object access patterns.</span></p></li></ol><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">=====================</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">Struct field access patterns</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">=====================</span></p><br><ol style="margin-top:0pt;margin-bottom:0pt"><li dir="ltr" style="list-style-type:decimal;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap">Get all the struct type information (e.g., via getIdentifiedStructTypes()), and create a counter for each field of each struct.</span></p></li><li dir="ltr" style="list-style-type:decimal;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap">Instrument each GEP (GetElementPtr) instruction if it operates on a struct type to update the corresponding field counter.</span></p></li><li dir="ltr" style="list-style-type:decimal;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap">At the program exit, filter and sort the struct field reference counters, and print the struct field hotness information for those structs deemed most likely to affect performance.  The sorting/filtering metric could include disparity between fields: hot fields interleaved with cold fields, with a total access count high enough to matter.</span></p></li></ol><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">There are a few potential problems with this simple approach:</span></p><ul style="margin-top:0pt;margin-bottom:0pt"><li dir="ltr" style="list-style-type:disc;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap">Overcount: a GEP instruction does not necessarily mean a field access.</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap">Undercount: a GEP instruction may lead to multiple field accesses, especially if the address is passed to another function as an argument.</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap">Racy update by multiple threads.</span></p></li></ul><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">We want to keep the instrumentation simple in our initial implementation for both robustness and performance reasons, so we will defer any analysis (e.g., data flow analysis) to later stages. Any suggestions on how to improve the accuracy are welcome.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">There is one simple improvement we may want to explore: the temporal locality of struct field accesses.</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">Two struct fields being hot (i.e., frequently accessed) does not necessarily mean they are accessed together. We want to know the affinity among those struct fields, which could be determined via a sampling approach: track which fields are accessed together during the last period at each sample, and update an affinity table for the final report.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">We expect the time overhead of the tool to be well under the 5x EfficiencySanitizer ceiling; presumably it should be under 2x.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">===========================</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">Heap/global object access patterns</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">===========================</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">We plan to use shadow memory and sampling to keep track of heap/global object accesses.</span></p><br><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap;background-color:transparent">Shadow memory:</span><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap;background-color:transparent"><br class="gmail-kix-line-break"></span><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap;background-color:transparent">We use a 4byte-to-1byte shadow mapping.  Each application word is mapped to a </span></span><span id="gmail-docs-internal-guid-f41ea31d-3ff9-f075-571d-cc99b0cb38d0"><ul style="margin-top:0pt;margin-bottom:0pt"><li dir="ltr" style="list-style-type:disc;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;background-color:transparent"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap;background-color:transparent">shadow byte, and so a 64-byte cache line is mapped to a 16-byte shadow memory. In each shadow byte, the highest bit is used for indicating whether the corresponding application word is accessed, and the other 7 bits are used as a counter for the hotness of the application word.</span><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap;background-color:transparent"><br class="gmail-kix-line-break"><br class="gmail-kix-line-break"></span></p></li><li dir="ltr" style="list-style-type:disc;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;background-color:transparent"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap;background-color:transparent">Instrumentation:</span><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap;background-color:transparent"><br class="gmail-kix-line-break"></span><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap;background-color:transparent">On every memory reference, the instrumented code simply checks if the highest bit is set. If not, the code sets it using an OR operation. We will live with races in updating shadow memory bits.</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;background-color:transparent"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap;background-color:transparent">Sampling:</span><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap;background-color:transparent"><br class="gmail-kix-line-break"></span><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap;background-color:transparent">On each sample we scan the shadow memory. If the highest bit of a shadow byte is set, we increment the 7-bit counter (to the maximum of 127; if this is found to be too small we could use separate storage for an additional counter for hot fields).</span></p></li><li dir="ltr" style="list-style-type:disc;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;background-color:transparent"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap;background-color:transparent">Memory allocation wrapping:</span><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap;background-color:transparent"><br class="gmail-kix-line-break"></span><span style="font-size:14.6667px;vertical-align:baseline;white-space:pre-wrap;background-color:transparent">When a heap object is freed, we acquire the callstack and its access pattern in the shadow memory. We may coalesce them based on the allocation/free callstack.</span></p></li></ul><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap;background-color:transparent">The report from the tool to the user at the end of execution would essentially be a list of objects that have some significant fragmented access pattern. </span><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">We expect the time overhead of the tool to be well under the 5x EfficiencySanitizer ceiling; presumably it should be under 3x.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">We plan to implement both struct field access tracking and shadow based heap/global object access tracking. In our initial implementation, we plan to provide both results to developers simultaneously.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">There are a number of alternative approaches that could be explored, including a 4byte:4byte 32-bit hotness counter per 4-byte field, or a 1byte:1bit bitmap for field byte granularity with sampling.  Extensions to the proposals above could also be explored in the future, such as combining the struct and shadow modes for better results. Additionally, we may use the 7 shadow bits differently to track temporal locality information instead. Any suggestions are also welcome.</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap"><br></span></p><p style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:14.6667px;font-family:arial;color:rgb(0,0,0);vertical-align:baseline;white-space:pre-wrap">-- Qin<br></span></p></span></div>