[PATCH] D45757: [XRay][profiler] Part 2: XRay Function Call Trie

Mon May 14 17:26:27 PDT 2018

dberris added inline comments.

================
Comment at: compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc:140-142
+  EXPECT_EQ(F3.CumulativeLocalTime, 100);
+  EXPECT_EQ(F2.CumulativeLocalTime, 300);
+  EXPECT_EQ(F1.CumulativeLocalTime, 100);
----------------
kpw wrote:
> dberris wrote:
> > kpw wrote:
> > > Imho, this would be easier to interpret if the nodes captured CumulativeTreeTime instead of CumulativeLocalTime.
> > > 
> > > Then it would be
> > > F3 -> 100
> > > F2 -> 300
> > > F1 -> 400
> > I actually thought about counting both, but realised that there's a property here which allows us to count just one of them to derive the other.
> > 
> > I initially thought about it this way:
> > 
> > ```
> > CTT(N) = CLT(N) + sigma(i = 0->N) CTT(callee(N)[i])
> > ```
> > 
> > is equivalent to:
> > 
> > ```
> > CLT(N) = CTT(N) - sigma(i = 0->N) CTT(callee(N)[i])
> > ```
> > 
> > Where `CTT` is "Cumulative Tree Time" and `CLT` is "CumulativeLocalTime". Can we prove that this property holds, and that measuring just `CTT` is sufficient to derive `CLT`?
> > 
> > To do this properly I'll introduce a notation:
> > 
> > ```
> > (f -> f') @ t[n+0]
> > (f <- f') @ t[n+1]
> > ```
> > 
> > Where `f` and `f'` are function IDs, and `->` represents a "calls" relationship, '<-' represents an "exits" relationship, and `t` is a timestamp (we denote `n` to be the order of timestamps by appearance). Given the following sequence of events:
> > 
> > ```
> > (f1 -> f2) @ t[0]
> > (f2 -> f3) @ t[1]
> > (f2 <- f3) @ t[2]
> > (f2 -> f3) @ t[3]
> > (f2 <- f3) @ t[4]
> > (f1 <- f2) @ t[5]
> > ```
> > 
> > Here, if we think about the cumulative local times, we might think that:
> > 
> > ```
> > CTT(f1) = CTT(f2) + CLT(f1)
> > CTT(f2) = CTT(f3) + CLT(f2)
> > CTT(f3) = 0 + CLT(f3)
> > ```
> > 
> > As per formula above.
> > 
> > If we expand this:
> > 
> > ```
> > CTT(f3) = 0 + (t[4] - t[3]) + (t[2] - t[1])
> > CTT(f2) = CTT(f3) + (t[5] - t[4]) + (t[3] - t[2]) + (t[1] - t[0])
> > CTT(f1) = CTT(f2) + 0
> > ```
> > 
> > Let's expand it further:
> > 
> > ```
> > CTT(f2) = 0 + (t[4] - t[3]) + (t[2] - t[1]) + (t[5] - t[4]) + (t[3] - t[2]) + (t[1] - t[0])
> > 
> > CTT(f2) = (t[5] - t[4]) + (t[4] - t[3]) + (t[3] - t[2]) + (t[2] - t[1]) + (t[1] - t[0]) + 0
> > 
> > CTT(f2) = t[5] - t[0]
> > ```
> > 
> > This is what we'd expect for computing `CTT` from `CLT`. Can we do the reverse though?
> > 
> > ```
> > CLT(f2) = CTT(f2) - CLT(f3)
> > CLT(f2) = CTT(f2) - (CTT(f3) + 0)
> > CLT(f2) = (t[5] - t[0]) - (0 + (t[4] - t[3]) + (t[2] - t[1]) + 0)
> > CLT(f2) = (t[5] - t[0]) - ((t[4] - t[3]) + (t[2] - t[1]))
> > CLT(f2) = (t[5] - t[4]) + (t[3] - t[2]) + (t[1] - t[0])
> > ```
> > 
> > QED
> > 
> > There's an argument for doing either, but we make the trade-off to covering Cumulative Local Time instead at runtime for the following reasons:
> > 
> > - Using CLT reduces the risk of us overflowing counters.
> > - We need CLT anyway for generating the histogram of latency for a particular function in the stack context.
> > - CLT allows us to better account for when we're un-winding the stack in case we find an exit for a function that was entered "higher up".
> > 
> > Does that make sense?
> > 
> > Now I kind-of want to write up that formula somewhere more persistent, rather than just in a review thread. :)
> Thanks for the detailed response. I agree with all of your points except for the that CLT is the right choice for latency histograms.
> 
> Responding to your points in turn:
> 
> 1. Yes, CLT and CTT properties can be derived from the other.
> 2. Overflow is less likely with CLT. I haven't done the math for a typical CPU to approximate the risk for a 32 bit uint.
> 3. For either CLT or CTT, the "higher up" exit is handled the same way: as if there were simultaneous exits for all the functions until the matching function id.
> 
> And finally, latency histograms of CTT can't be used to derive CLT and vice-versa. This is where we actually have to make a choice between the two or compute both.
> 
> Imagine a function "processRpc(const MyRequest&, MyResponse*)". When I want to know the 95th percentile of its latency, I am interested in CTT, not CLT. All the time spent in lower levels of the stack affects the observed latency. Callers care about how long the function blocks their execution. The person trying to optimize a function may care about CLT when choosing which code path to target.
> 
> Let's cross this bridge when we add histograms.
> And finally, latency histograms of CTT can't be used to derive CLT and vice-versa. This is where we actually have to make a choice between the two or compute both.

Yes!

> Let's cross this bridge when we add histograms.

Agreed.

Thanks!

================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:380
+  // data from the original (current) trie, along with all its callees.
+  void mergeInto(FunctionCallTrie &O) const {
+    for (const auto Root : getRoots()) {
----------------
kpw wrote:
> dberris wrote:
> > kpw wrote:
> > > What's the thread safety of this and deepCopy? It seems like they  shouldn't be called when functions are being intercepted. How can we make sure that invariant is preserved?
> > Good question. They are thread-compatible (will need external synchronisation).
> I'm curious how this will work for the data collector service ensuring that function traces won't interfere with flushing.
Yeah, Part 3 actually does external synchronisation to ensure that when posting, we're making a copy and merging to a global. There's some interesting work happening there, but not as interesting as the stuff we're doing here. ;)

https://reviews.llvm.org/D45757