[PATCH] D45758: [XRay][profiler] Part 3: Profile Collector Service

Sun May 20 22:22:58 PDT 2018

dberris added inline comments.

================
Comment at: compiler-rt/lib/xray/tests/unit/profile_collector_test.cc:72
+  auto B = profileCollectorService::nextBuffer({nullptr, 0});
+  ValidateBlock(B);
+}
----------------
kpw wrote:
> Would be better (and more difficult) to validate the contents of the block in addition to the header. This doesn't verify that any of the profiling data is correct and consistent with the Trie.
Yeah, unfortunately that's too much coupling to be testing at this point -- my thought was to do that testing on a end-to-end basis, so we can update the format and only determine the necessary parts here (i.e. since we only need the block size, number, and thread-id). For instance, when we do have the profile saving mechanism (in Part 5) and a loader (forthcoming) we can tweak the details of what's actually in the profile more rigorously.

================
Comment at: compiler-rt/lib/xray/xray_profile_collector.cc:71
+    Item->Trie = reinterpret_cast<FunctionCallTrie *>(
+        InternalAlloc(sizeof(FunctionCallTrie)));
+    new (Item->Trie) FunctionCallTrie(*GlobalAllocators);
----------------
kpw wrote:
> Why should we be using InternalAlloc instead of GlobalAllocators for these? Seems counterintuitive but this is a lot of stuff to keep track of so I could be thinking about it wrong.
We're using a `Vector` here, instead of a segmented array for two reasons:

1. We cannot put non-trivally destructible things in the segmented array.
2. We leverage the contiguous nature of the `Vector` implementation here for when we iterate through the FunctionCallTrie elements.

We're using InternalAlloc(...) here as well to not be dependent on the restrictions on the managed allocator we're using for the data structures.

Documenting some of these to keep the context.

================
Comment at: compiler-rt/lib/xray/xray_profile_collector.cc:85
+
+  // The Path in this record is in reverse order.
+  PathArray *Path = nullptr;
----------------
kpw wrote:
> Reverse order is not as clear as "ordered from the deepest callee's funcId to the root caller's funcId" or something like that.
> 
> Is the funcId for the Node also also in the array or is it elided?
The path is just the function id's of the function in a stack trace. This allows us to identify each call stack uniquely, in leaf-to-root order.

================
Comment at: compiler-rt/lib/xray/xray_profile_collector.cc:144-148
+    // Add the sentinel here.
+    constexpr int32_t SentinelFId = 0;
+    NextPtr = static_cast<char *>(
+                  internal_memset(NextPtr, SentinelFId, sizeof(SentinelFId))) +
+              sizeof(SentinelFId);
----------------
kpw wrote:
> Does PathArray have a constant time size method? I've never regretted a length-prefix encoding compared to sentinel termination where possible.
> 
> Edit: Just checked xray_segmented_array.h and size is constant time. I think we can design a seekable format (seek by record) without space or time overhead and I'd encourage a seekable format as it's more flexible.
Yes, PathArray is an `__xray::Array<...>` which stores the size as a member, so O(1) check on the size.

I thought about always having size prefixes, but that makes it much harder to write -- i.e. reserving space for the size of the data before writing it. The value for a seekable format isn't as high at this time, we just need a way to store this in a reasonably compact format, and assume it will be read in a streaming fashion.

================
Comment at: compiler-rt/lib/xray/xray_profile_collector.cc:160-164
+    // Add an end of record sentinel here.
+    constexpr uint64_t EndOfRecord = 0x0;
+    NextPtr = static_cast<char *>(
+                  internal_memset(NextPtr, EndOfRecord, sizeof(EndOfRecord))) +
+              sizeof(EndOfRecord);
----------------
kpw wrote:
> Does having a record delimiter actually get us anything meaningful? The format is not seekable by record as is. If we could make it seekable without adding overhead that would be best, otherwise, what do you think about removing the delimiter since it's implied by the position of the end of function ids sentinel?
> 
> Edit: Block header has total size. That might be sufficient and we can decide whether to remove record delimiter or keep it as sanity check framing.
Good point. Removing the end-of-record sentinel gets us more bytes.

================
Comment at: compiler-rt/lib/xray/xray_profile_collector.cc:258-259
+XRayBuffer nextBuffer(XRayBuffer B) {
+  // FIXME: Find a way to identify the buffers, somehow, from within this
+  // function.
+  SpinMutexLock Lock(&GlobalMutex);
----------------
kpw wrote:
> I don't understand what's meant here. :(
Oops -- yeah, that's fixed. We can now tell which block number the buffer represents. :)

================
Comment at: compiler-rt/lib/xray/xray_profile_collector.h:48
+/// process their own thread-local FunctionCallTrie instances, we're removing
+/// the need for synchronisation across threads while we're profiling.
+/// However, once we're done profiling, we can then collect copies of these
----------------
kpw wrote:
> This pushes the need to synchronize into the caller that controls when the state can change from FINALIZED to INITIALIZED.
> 
> It's too big of a change to introduce a new state, but we should either.
> 1. Point out that this synchronization is necessary by the caller in this documentation or
> 2. Have the __xray_init or whatever calls fail when this processing is occuring.
We already do this -- the "caller" in this case will be the XRay implementation that's using this service. In particular, the only user of this is the profiling mode implementation.

In particular, this function doesn't interfere with any of the higher level states -- all the synchronisation it needs is internal. The documentation is saying that callers of this function only need to worry about synchronising the FunctionCallTrie it has access to, if it needs to. The implementation we have in the profiling mode uses thread-local FunctionCallTrie objects, which means from the thread that has access to that trie, it doesn't need to synchronise anything.

The next patch implements the finalization routine, which deals directly with this service's APIs "correctly".

Is there something else that will make this documentation more clear from that perspective?

================
Comment at: compiler-rt/lib/xray/xray_profile_collector.h:67
+///   - list of records:
+///     - function ids in reverse call order, from leaf to root, terminated by
+///       0 (32 bits per function id)
----------------
kpw wrote:
> "reverse call order" is unclear for a Trie. Do functions with multiple callees have duplicate entries? There are no parent pointers described here.
> 
> Edit: After reading through the CL, I realized I misunderstood the rest of the data (such as call count, etc.) to be associated with each entry in the PathArray, while in reality I think it's just the function ids. I think maybe this could be cleared up by using proto like language while keeping the indentation. E.g. 
> 
> repeated record:
>    - repeated function_id ordered from callee to root:
>    - call count
>    - cumulative ...
> 
> There's nothing wrong with the current comment, but I misunderstood it and looking to clarify.
This essentially represents the data associated per unique stack we've encountered. I think you got it, but it's unclear how to change this comment to make it clearer. A "path" is essentially the stack trace in leaf-to-root order, which is guaranteed unique in a prefix tree.

https://reviews.llvm.org/D45758