<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Mon, Jul 4, 2016 at 6:27 PM David Chisnall <<a href="mailto:david.chisnall@cl.cam.ac.uk">david.chisnall@cl.cam.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 4 Jul 2016, at 06:50, Dean Michael Berris via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br>

><br>

> We've looked at the following alternatives, and we're looking to the community for feedback on both the current implementation and these alternatives.<br>

<br>

I don’t think that I’ve yet seen an explanation of why you need the NOPs.  DTrace stopped using them a long time ago, for two reasons:<br>

<br>

1) The increased code size caused a noticeable increase in i-cache misses, even when instrumentation was not actively being used.  This caused a noticeable probe effect (macroscopic observable performance artefacts even when no probes are active) and caused a lot of push-back in adoption.<br>

<br>

2) On all of the architectures where we support DTrace (currently, I believe, x86, x86-64, AArch32, AArch64, MIPS64, and RISC-V) it’s possible to do the same thing by moving one of the instructions in the function prolog into the generated trampoline for the instrumentation.<br>

<br>

I could understand wanting something more like patchpoints if you want to be able to instrument in the middle of a function (along the lines of TESLA or CSI), but if you’re just tracing function entry and exit then it doesn’t seem like the best solution.<br>

<br></blockquote><div><br></div><div>Thanks for the questions David -- the short version of the answer is that DTrace (last I checked) requires some help from the Kernel, while XRay is self-contained in the application.</div><div><br></div><div>All of your points above are valid, and DTrace is a really powerful tool for debugging a lot of performance issues. XRay has a few things that differentiate it from systems like DTrace though:</div><div><br></div><div>1) Because we insert the instrumentation sleds in specific functions that fit a certain criteria (i.e. more selectively) instead of instrumenting every function, we pay the cost of the instrumentation being off only on functions that are instrumented. The combination of the changes in the front-end to support attributes/annotations in the code to force-instrument/-inhibit instrumentation gives control to the application developer, allows us to limit the cost along a spectrum -- full coverage costs more, selective coverage can be tuned, and explicit annotations provide precise control of the instrumentation.</div><div><br></div><div>2) The cost of the instrumentation at run-time is O(100) cycles for the "null-logging" case (mov + trampoline jump, atomic load and check if not zero). All the cost of instrumentation is within the process' address space (in-memory log) when on -- no additional overheads external to the application.</div><div><br></div><div>3) The runtime implementation for logging described in the white paper allows us to balance the coverage (number of instrumentation events we get) with overheads (the amount of resources used in the logging implementation). Because we log only very specific things (function id, tsc deltas in most cases, type of event) and have heuristics to condense the information we keep (i.e. if entry-exit pairs are under epsilon, we can omit the entry entirely), we don't need to be quite as complete when logging and instead move a lot of the logic in reconstruction/analysis of the generated traces.</div><div><br></div><div>There are certainly other approaches to doing selective instrumentation, and then externally signalling/trapping (with environment support) when probing. XRay moves this needle towards having the instrumentation and collection and even signalling into the application. This makes sense if you're deploying the application on a system that doesn't have DTrace and still be able to isolate the costs of instrumentation just to the application.</div><div><br></div><div>I'll admit that I'll need to read a lot more about how DTrace manages to keep the costs of probes low enough that it could be turned on dynamically without stopping the process, and without having to intercept more events than actually necessary (i.e. only on certain functions, and only when it's on) to be able to provide a more complete answer.</div><div><br></div><div>Does this help?</div><div><br></div><div>Cheers</div></div></div>