<div dir="ltr"><div dir="ltr"><div>This change is part of a larger system, consisting of a cache prefetches recommender, <a href="https://github.com/google/autofdo">create_llvm_prof</a>, and LLVM.</div><div><br></div><div>A proof of concept recommender is <a href="https://github.com/DynamoRIO/dynamorio/blob/master/clients/drcachesim/simulator/cache_miss_analyzer.cpp">DynamoRIO's cache miss analyzer</a>. It processes memory access traces obtained from a running binary and identifies patterns in cache misses. Based on them, it produces a csv file with recommendations. The expectation is that, by leveraging such recommendations, we can reduce the amount of clock cycles spent waiting for data from memory. A microbenchmark based on the DynamoRIO analyzer is available as a <a href="https://goo.gl/6TM2Xp">proof of concept</a>.</div><div><br></div><div>The recommender makes prefetch recommendations in terms of:</div><div><br></div><div>- the binary offset of an instruction with a memory operand;</div><div>- a delta;</div><div>- and a type (nta, t0, t1, t2)</div><div><br></div><div>meaning: a prefetch of that type should be inserted right before the instruction at that binary offset, and the prefetch should be for an address delta away from the memory address the instruction will access.</div><div><br></div><div>For example:</div><div><br></div><div>0x400ab2,64,nta</div><div><br></div><div>and assuming the instruction at 0x400ab2 is:</div><div><br></div><div>movzbl (%rbx,%rdx,1),%edx</div><div><br></div><div>means that the recommender determined it would be beneficial for a prefetchnta instruction to be inserted right before this instruction, as such:</div><div><br></div><div>prefetchnta 0x40(%rbx,%rdx,1)</div><div>movzbl (%rbx, %rdx, 1), %edx</div><div><br></div><div>The workflow for prefetch cache instrumentation is as follows (the proof of concept script details these steps as well):</div><div><br></div><div>1. build binary, making sure -gmlt -fdebug-info-for-profiling is passed. The latter option will enable the X86DiscriminateMemOps pass, which ensures instructions with memory operands are uniquely identifiable (this causes ~2% size increase in total binary size due to the additional debug information).</div><div><br></div><div>2. collect memory traces, run analysis to obtain recommendations (see above-referenced DynamoRIO-based analyzer demo as a proof of concept).</div><div><br></div><div>3. use create_llvm_prof to convert recommendations to reference insertion locations in terms of debug info locations.</div><div><br></div><div>4. rebuild binary, using the exact same set of arguments used initially, to which -mllvm -prefetch-hints-file=<file> need to be added, using the afdo file obtained at step 3.</div><div><br></div><div>Note that if sample profiling feedback-driven optimization is also desired, that happens before step 1 above. In this case, the sample profile afdo file that was used to produce the binary at step 1 must also be included in step 4.</div><div><br></div><div>The data needed by the compiler in order to identify prefetch insertion points is very similar to what is needed for sample profiles. For this reason, and given that the overall approach (memory tracing-based cache recommendation mechanisms) is under active development, we use the afdo format as a syntax for capturing this information. We avoid confusing semantics with sample profile afdo data by feeding the two types of information to the compiler through separate files and compiler flags. Should the approach prove successful, we can investigate improvements to this encoding mechanism.</div><div><br></div><div><a href="https://reviews.llvm.org/D54052">https://reviews.llvm.org/D54052</a><br></div><div><br></div></div></div>