[llvm-dev] [XRay] Build instrumented Clang, some analysis results

Wed Jul 20 04:39:26 PDT 2016

> On 20 Jul 2016, at 20:58, C Bergström <cbergstrom at pathscale.com> wrote:
> 
> On Wed, Jul 20, 2016 at 6:26 PM, Dean Michael Berris
> <dean.berris at gmail.com> wrote:
>> 
>>> On 20 Jul 2016, at 20:02, C Bergström <cbergstrom at pathscale.com> wrote:
>>> 
>>> How much is this tied to something specific about Linux or it could be
>>> easily ported to another platform?
>> 
>> Currently, the only Linux-specific part I can remember is getting the cpu frequency (looking at sysfs files). That can be implemented on a platform-agnostic (or at least pluggable and portable) manner.
>> 
>> There are x86'isms and I'm working on understanding how to do this in Aarch64 or ARM.
> 
> Ack - actually x86 probably makes some of this a lot easier. I'm
> recently (frequently) annoyed (as hell) with how AArch64 isn't
> exposing a bunch of basic things that I *want* (demand!) to know.
> 
> For example:
> clock frequency on AArch64 + Linux == forget it. I had to use a
> benchmark in order to basically brute force calculate some processors.
> (They don't hard code it in /proc/cpuinfo or sys as you'd want) and
> I'm really uncertain about what happens if there's stepping involved
> (current AArch64 processors that I'm aware of don't have this feature
> though) /* Maybe Google can help kick the linux devs into accepting
> these patches */
> 

I actually haven't gotten that far yet, to be honest -- I was still just trying to learn how to do the runtime patching and the instrumentation sleds faster. :)

But it is good to know what other kinds of things I might run into when we cross that bridge. :D

> For FBSD and iOS - I don't know how/if they expose this information..
> (Is FBSD ported to AArch64 yet.. ?)

I have no idea about FreeBSD. :/

> 
>> 
>>> 
>>> What's the benefit of this vs a stable and production ready tool like Dtrace?
>>> 
>> 
>> I think I've pointed out the differences in a separate mail (some mail filters may have squashed that response, so apologies if that was missed): http://lists.llvm.org/pipermail/llvm-dev/2016-July/101922.html -- the short version is:
>> 
>> - Dtrace requires kernel-side support.
>> - XRay is completely in-process and controllable by the process through an API (not sure if dtrace is the same).
>> - XRay is selective and configurable by the application developer.
>> - XRay's cost is borne by the application only, and does not require stopping the application.
> 
> Just as you're instrumenting around functions - DTrace can similarly
> inject "probes" (basically the same thing) - The other more common way
> for DTrace to be used is for the application to not be changed and
> it's just profiled. (Ok you must leave SP otherwise it won't work.. so
> for the purist I guess you're relying on applications not to be
> /fully/ optimized.. I forget if DWARF or CFW is required, but I don't
> think so )
> 

XRay doesn't rely on DWARF, and has a separate section for the instrumentation maps. That section can also be removed from the final binary and loaded externally (that feature isn't implemented yet, but I'm working on making that happen). XRay also works even if the frame pointer is omitted which is a nice property. :)

> Dtrace also doesn't require stopping the application fwiw and you can
> control probably a lot more of what's probed/instrumented. (There's a
> full scripting langauge in order to control what you instrument
> actually)
> 

I'm aware that Dtrace can do what XRay does and more. I'm not so sure about the technical details of some of how it does its thing -- for example, XRay isn't sampling anything and instead is made to be logging stuff for offline reconstruction/analysis.

> I'm not trying to take away from X-Ray, I think profiling is extremely
> important, but I'm just wondering how much (if any) evaluation of
> existing solutions was done.

That's fair -- XRay was developed at Google a long time ago, when Dtrace wasn't available. Our internal implementation has a lot of... internal'isms which integrates well into our... internal stuff. :)

The landscape has changed though considerably since XRay was developed and when we decided to open-source an implementation of it. For example, clang wasn't even on the radar when some of the work on XRay started happening. Certainly there's lots of ways of doing this now, but the target at least for XRay is so that we can:

- Widen the set of platforms where we can use it. Linux+x86 is the "feature parity" point for us at least. And we're certainly interested in a lot more platforms now.

- Have better hooks into how much more efficiently we can make it. LLVM IR and the optimisation pass and analysis infrastructure gives us much more leeway into being smarter about certain instrumentation decisions.

- Make it more useful than just what our use-case has been. For example, we do performance analysis on long-running servers and want to be able to do instrumentation for only a certain period of time (not during the lifetime of the application). The logging implementation we have internally (that we're bringing out in the open) has a lot of cleverness to get the tradeoff between cost and coverage "just right" for our use-cases. There are other cases where this makes sense too and we recognise that being able to get a full execution trace (not sampled traces, not sampled profiles) for easier performance debugging of things like compilers, command-line tools, and other classes of applications does make sense too.

> Maybe the DTrace licensing, CTF dep or
> linux support was a dealbreaker, I just hate to see NIH when there's
> good tools available that cover a significant amount of the needs or
> more.
> 

There are a couple of other things -- like the cost of when instrumentation is on. Dtrace currently requires a mode switch when the probe is encountered which is a non-trivial cost for the kinds of applications we've been debugging. Certainly that and the potential of affecting other applications/systems while Dtrace is enabled (and frankly the kinds of things you *can* do with it) becomes very hard to use at least for some of our systems where XRay has been used in terms of debugging.

/me is also not a fan of NIH. :)

> At the end of the day it's probably quite complementary, like all the
> work you do for X-Ray, someone could likely leverage to automatically
> inject DTrace probes and get a lot of the same stuff.

Agreed. Also consider the non-Linux systems, and those that have stricter requirements on resource consumption, etc. :)

> 
> In your response you mentioned "O(100) cycles" - is that 100
> instructions of "skid" between point of measurement? (Seems really
> high for instrumenting, but maybe I'm mistaken..)
> 

That's CPU cycles, and mostly the following:

For entry points:
- calling into a trampoline (relative jump)
- saving register states
- checking a global value if it's null (the logging intercept function pointer)
- loading register states
- returning

For exit points:
- jumping into a trampoline (relative jump)
- saving a couple of registers
- checking a global value if it's null (the logging intercept function pointer)
- loading register states
- returning

Now the logging intercept function should be tuned to do as little as possible. The implementation in the patches I mentioned to compiler-rt uses thread_local buffers and attempts to do "as little as possible" to write out the fixed sized log entries. Basically the cost of getting TSC and some stores.

> Lastly and again just side comments - in terms of data formats - JSON
> has pretty good support, streams and compresses nicely, high
> performance parses exist with liberal licensing as well as I think
> there's a binary version of it. This could be handy *if* your app is
> on Node B and you'd like the logs to be sent to Node A.
> 

Yeah, JSON is definitely one potential format. I've been meaning to use the Chrome profile viewer too. The problem has been the amount of data we're talking about here -- with 32-byte fixed-sized records currently per entry/exit, we're already at 606MB for a fully instrumented clang compiling a simple hello-world program (compressed is 81MB). I haven't tried writing this out in JSON, but I suspect that's multiples of the fixed-sized records. :D

We can probably optimise that further with the stack de-duping support in the JSON format, but that's still a lot of segments/events, even if it could be converted to JSON. :)

> Anywho - cool work..

Thanks! :)