[PATCH] D45757: [XRay][profiler] Part 2: XRay Function Call Trie

Mon May 14 02:33:19 PDT 2018

dberris added inline comments.

================
Comment at: compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc:83-91
+  const auto R0 = R[0];
+  ASSERT_NE(R0, nullptr);
+  EXPECT_EQ(R0->CallCount, 1u);
+  EXPECT_EQ(R0->CumulativeLocalTime, 1u);
+
+  const auto R1 = R[1];
+  ASSERT_NE(R1, nullptr);
----------------
kpw wrote:
> The way this test is written, the two roots look exactly identical. There is a possible error case where the trie returns the same root twice that is undetected.
> 
> You might have one function spend more time so that this is detected. Then you might want the test to assert ignoring order. Up to you whether it's worth it.
Good point -- I've instead asserted that R0 and R1 don't have the same function id.

================
Comment at: compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc:140-142
+  EXPECT_EQ(F3.CumulativeLocalTime, 100);
+  EXPECT_EQ(F2.CumulativeLocalTime, 300);
+  EXPECT_EQ(F1.CumulativeLocalTime, 100);
----------------
kpw wrote:
> Imho, this would be easier to interpret if the nodes captured CumulativeTreeTime instead of CumulativeLocalTime.
> 
> Then it would be
> F3 -> 100
> F2 -> 300
> F1 -> 400
I actually thought about counting both, but realised that there's a property here which allows us to count just one of them to derive the other.

I initially thought about it this way:

```
CTT(N) = CLT(N) + sigma(i = 0->N) CTT(callee(N)[i])
```

is equivalent to:

```
CLT(N) = CTT(N) - sigma(i = 0->N) CTT(callee(N)[i])
```

Where `CTT` is "Cumulative Tree Time" and `CLT` is "CumulativeLocalTime". Can we prove that this property holds, and that measuring just `CTT` is sufficient to derive `CLT`?

To do this properly I'll introduce a notation:

```
(f -> f') @ t[n+0]
(f <- f') @ t[n+1]
```

Where `f` and `f'` are function IDs, and `->` represents a "calls" relationship, '<-' represents an "exits" relationship, and `t` is a timestamp (we denote `n` to be the order of timestamps by appearance). Given the following sequence of events:

```
(f1 -> f2) @ t[0]
(f2 -> f3) @ t[1]
(f2 <- f3) @ t[2]
(f2 -> f3) @ t[3]
(f2 <- f3) @ t[4]
(f1 <- f2) @ t[5]
```

Here, if we think about the cumulative local times, we might think that:

```
CTT(f1) = CTT(f2) + CLT(f1)
CTT(f2) = CTT(f3) + CLT(f2)
CTT(f3) = 0 + CLT(f3)
```

As per formula above.

If we expand this:

```
CTT(f3) = 0 + (t[4] - t[3]) + (t[2] - t[1])
CTT(f2) = CTT(f3) + (t[5] - t[4]) + (t[3] - t[2]) + (t[1] - t[0])
CTT(f1) = CTT(f2) + 0
```

Let's expand it further:

```
CTT(f2) = 0 + (t[4] - t[3]) + (t[2] - t[1]) + (t[5] - t[4]) + (t[3] - t[2]) + (t[1] - t[0])

CTT(f2) = (t[5] - t[4]) + (t[4] - t[3]) + (t[3] - t[2]) + (t[2] - t[1]) + (t[1] - t[0]) + 0

CTT(f2) = t[5] - t[0]
```

This is what we'd expect for computing `CTT` from `CLT`. Can we do the reverse though?

```
CLT(f2) = CTT(f2) - CLT(f3)
CLT(f2) = CTT(f2) - (CTT(f3) + 0)
CLT(f2) = (t[5] - t[0]) - (0 + (t[4] - t[3]) + (t[2] - t[1]) + 0)
CLT(f2) = (t[5] - t[0]) - ((t[4] - t[3]) + (t[2] - t[1]))
CLT(f2) = (t[5] - t[4]) + (t[3] - t[2]) + (t[1] - t[0])
```

QED

There's an argument for doing either, but we make the trade-off to covering Cumulative Local Time instead at runtime for the following reasons:

- Using CLT reduces the risk of us overflowing counters.
- We need CLT anyway for generating the histogram of latency for a particular function in the stack context.
- CLT allows us to better account for when we're un-winding the stack in case we find an exit for a function that was entered "higher up".

Does that make sense?

Now I kind-of want to write up that formula somewhere more persistent, rather than just in a review thread. :)

================
Comment at: compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc:190-192
+  auto A = FunctionCallTrie::InitAllocators();
+  FunctionCallTrie T0(A);
+  FunctionCallTrie T1(A);
----------------
kpw wrote:
> Is sharing an initialized allocator expected to be OK? Why not use the same allocator for FunctionCallTrie Merged below then?
Yes. The reason we're not using the same allocator for the FunctionCallTrie merging test, is because we want to make sure that functionality works -- because we do that when we're transferring the FunctionCallTrie from a thread to the central service (in the next patch). We want to have thread-local allocators, then the central storage service will use a single FunctionCallTrie for the "merged" version of the FunctionCallTrie's for all the threads.

================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:318
+      TopNode->CallCount++;
+      TopNode->CumulativeLocalTime += LocalTime - CumulativeTreeTime;
+      CumulativeTreeTime += LocalTime;
----------------
kpw wrote:
> To me, it is less intuitive to capture CumulativeLocalTime than CumulativeTreeTime in each Node. With one you can compute the other given the callees, but if I'm analyzing function latencies, I care about time spent in the callees just as much.
> 
> Why did you make this choice?
Explained in the comment above... with some math/proof. :)

================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:336-337
+  //
+  // This operation will *not* destroy the state in `O`, and thus may cause some
+  // duplicate entries in `O` if it is not empty.
+  void deepCopyInto(FunctionCallTrie &O) const {
----------------
kpw wrote:
> Should this be called with non-empty destinations? Can we just CHECK it is not. Having duplicate roots stinks.
Yes, fixed with a DCHECK.

================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:357
+      Stack DFSStack(StackAllocator);
+      DFSStack.Append(NodeAndParent{Root, NewRoot});
+      while (!DFSStack.empty()) {
----------------
kpw wrote:
> Do you have to handle the Allocator failing here with a null check? Are you just relying on those cases artificially pruning the traversal?
Good question. I'm definitely relying on the artificial pruning. Writing a comment as a TODO to figure out what to do in case of failure.

================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:380
+  // data from the original (current) trie, along with all its callees.
+  void mergeInto(FunctionCallTrie &O) const {
+    for (const auto Root : getRoots()) {
----------------
kpw wrote:
> What's the thread safety of this and deepCopy? It seems like they  shouldn't be called when functions are being intercepted. How can we make sure that invariant is preserved?
Good question. They are thread-compatible (will need external synchronisation).

================
Comment at: compiler-rt/lib/xray/xray_profiler_flags.cc:35
+#define XRAY_FLAG(Type, Name, DefaultValue, Description)                       \
+  RegisterFlag(P, #Name, Description, &F->Name);
+#include "xray_profiler_flags.inc"
----------------
kpw wrote:
> Is the "#Name" intentional? I'm not very fluent in macros, but that popped out.
Yes, this turns the argument into a string.

https://reviews.llvm.org/D45757