[PATCH] D44620: [XRay][compiler-rt] XRay Profiling Mode

Sun Apr 15 20:13:01 PDT 2018

dberris added inline comments.

================
Comment at: compiler-rt/lib/xray/tests/unit/allocator_test.cc:36
+  auto B1 = A.Allocate();
+  (void)B1;
+  auto B2 = A.Allocate();
----------------
pelikan wrote:
> ASSERT_EQ(A.Counter, 1);
This is not a valid assertion, since `Counter` is not relevant to the allocator's observable properties.

================
Comment at: compiler-rt/lib/xray/tests/unit/allocator_test.cc:39
+  ASSERT_EQ(B2.Data, nullptr);
+}
+
----------------
pelikan wrote:
> ASSERT_EQ(A.Counter, 1);
Same.

================
Comment at: compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc:39-40
+
+  Trie.enterFunction(1, 1);
+  Trie.exitFunction(1, 2);
+
----------------
pelikan wrote:
> For readability, can we have the TSCs to be 100, 200, 300 etc.?  Now the numbers look the same.  (and I see a test below does that already)
Can you explain better why using 100, 200, 300 as opposed to 1, 2, 3 is better?

================
Comment at: compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc:91-92
+
+// While missing an intermediary entry may be impossible in practice, we still
+// enforce that we can handle the case where we've missed the entry event
+// somehow, in between call entry/exits. To illustrate, imagine the following
----------------
pelikan wrote:
> f0 → f1 → setjmp ↓(f1→f0) longjmp ↓(f1↓f0) longjmp ↓(f1↓f0) should generate lots of exits but only one entry.  Or when the profiling starts in a signal handler.
> 
> So it shouldn't be "impossible", just "infrequent" :-)
Right, changed to "rare".

For now, we've not made special support for setjmp/longjmp instrumentation. While there's a possibility we can do that in the future, we're not counting on being able to differentiate that for now.

The signal handler case is precisely the one we're looking to support here, but that's just one case.

================
Comment at: compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc:145-150
+  Trie.enterFunction(1, 0);
+  Trie.enterFunction(2, 1);
+  Trie.exitFunction(2, 2);
+  Trie.enterFunction(3, 3);
+  Trie.exitFunction(3, 4);
+  Trie.exitFunction(1, 5);
----------------
pelikan wrote:
> Same, please use multiplies of 100 for TSCs.  I wonder whether we should test the TSC time series not being non-decreasing due to TSC mismatches when rescheduling among poorly synchronized CPU packages.
Good point. I'll leave a TODO for that. In particular we actually need to keep track of the CPU ID instead of just the TSC when we're building the shadow stack. That will let us track the migration of the thread(s).

It's still not clear to me why using multiples of 100 is important. These are just arbitrary numbers anyway, it shouldn't matter what order of magnitude they are.

================
Comment at: compiler-rt/lib/xray/tests/unit/profile_collector_test.cc:21
+TEST(ProfileCollectorServiceTest, PostSerializeCollect) {
+  // The most basic use-case (the one we actually only care about) is the one
+  // where we ensure that we can post FunctionCallTrie instances, which are then
----------------
pelikan wrote:
> did you mean: "the only one we actually care about"?
> 
> If we "only care" about it, what more can we do about it? :-)
I don't see the difference. English is hard.

"the only one we care about" == "the one we actually only care about"

There are other use-cases for this collection API, some of which we don't cover in this unit test (yet).

In particular, we could be collecting snapshots of the function call trie for a function every so often, and associating a timestamp to that, so we can show profiles over time instead of a single profile.

================
Comment at: compiler-rt/lib/xray/tests/unit/profile_collector_test.cc:47
+
+  ASSERT_NE(B.Data, nullptr);
+}
----------------
pelikan wrote:
> I would at least assert the buffer's size is within some reasonable bounds - has the trailing bit and a function list.  Maybe also the zero sentinels in places.
Some of these details aren't really relevant to the unit test. For example:

- The size of the block is dependent on how we've decided to serialise the data. Yes we can assert that the size is not zero (doing that now).
- The function list could be in any order. It's not a relevant feature of the API. What we care about is that we're able to get the data. The concern of parsing this data is not really at this level of the unit test (I'd rather we have an actual end-to-end test that would get this information).
- We're testing that we can get a buffer that's not the empty buffer, which tells us enough information to say that this API in particular is holding its promises based on the preconditions and postconditions.

================
Comment at: compiler-rt/lib/xray/tests/unit/segmented_array_test.cc:71
+
+TEST(SegmentedArrayTest, Iterators) {
+  Array<TestData> data;
----------------
pelikan wrote:
> Please also test what happens when you do
> 
> Array <TestData> data;
> auto it = data.begin();
> it--;
> 
> Because I think you'll find Offset will be SIZE_MAX.  Not sure we want that.
That's technically testing for undefined behaviour -- i.e. outside of the contract of the container. :)

================
Comment at: compiler-rt/lib/xray/xray_allocator.h:12-13
+//
+// Defines the allocator interface for an arena allocator, used primarily for
+// the profiling runtime.
+//
----------------
pelikan wrote:
> I would at least add a TODO for adding support for replacing this allocator with any of the security-checking ones.  (ubsan?  efence?  valgrind?  I keep forgetting the names of them.  compiler-rt IIRC has some "secure" allocator too.)
> 
> We don't want to become another OpenSSL.
We already do that, because we're relying on the underlying allocator for sanitizer_common. There's already a way to provide alternate implementations of those in that regard. All the backing store we have is gotten from sanitizer_common as opposed to using our own calls to mmap directly.

================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:116-117
+    NodeRefArray Callees;
+    int64_t CallCount;
+    int64_t CumulativeLocalTime;
+    int32_t FId;
----------------
pelikan wrote:
> Why are these not unsigned?  There's no such thing as a negative count or negative time spent in a function.  ShadowStackEntry has the time as u64.
> 
> (I'd make the function ID unsigned too.)
There are some potentially subtle issues with using unsigned in these variables. Some of them are:

* Forcing a value to be unsigned causes the compiler to implement modular arithmetic, even if we don't ever expect that values will wrap-around.
* Doing zero-sign extension is not cheap. We want to make these values as cheap as possible to update.

Also, we cannot make the function ID unsigned, because the value we're getting from XRay is a signed number (int32_t). The conversion will not be faithful, and we've avoided unsigned in those cases for similar reasons.

If we decide in the future that we can actually get away with unsigned values for function ids (which I think we can) then we can change all the XRay implementations to take unsigned values for the function id, etc. -- most of which is not really worth the cost.

================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:131
+  struct ShadowStackEntry {
+    int32_t FId;
+    uint64_t EntryTSC;
----------------
pelikan wrote:
> It's not clear to me why does this need FId when the Node pointer below has it as well.  Please add a brief comment if it's really necessary.
This is an optimisation, so that we don't actually need to reach into the pointer just to get the function id of the function at the top of the stack.

================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:152
+    using ShadowStackAllocatorType = ShadowStackArray::AllocatorType;
+    using NodeRefAllocatorType = NodeRefAllocatorType;
+
----------------
pelikan wrote:
> Um, why is this line necessary? :-)  Line 230 should work without Allocators:: too.
This is necessary because users of this nested type need to access these exported types. In this case, because NodeRefAllocatorType is part of the FunctionCallTrie type, we're re-exporting this type through Allocators which is a public type.

================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:255-256
+  void enterFunction(int32_t FId, uint64_t TSC) {
+    // This function primarily deals with ensuring that the ShadowStack is
+    // consistent and ready for when an exit event is encountered.
+    if (UNLIKELY(ShadowStack.empty())) {
----------------
pelikan wrote:
> Why is the comment related to the function not above the function's first line?  Same for exitFunction().
Because the comment is an implementation detail, it's explaining what it's doing rather than what users need to expect (i.e. it's not documentation of the contract, it's documentation of the implementation detail).

================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:315-319
+  // The deepCopyInto operation will update the provided FunctionCallTrie by
+  // re-creating the contents of this particular FunctionCallTrie in the other
+  // FunctionCallTrie. It will do this using a Depth First Traversal from the
+  // roots, and while doing so recreating the traversal in the provided
+  // FunctionCallTrie.
----------------
pelikan wrote:
> It's not clear to me from this comment that this function, unlike mergeInto, may create duplicate entries.
Good point. Updated the comment and the implementation to make it clear that we're *not* destroying the state of the FunctionCallTrie in `O`.

================
Comment at: compiler-rt/lib/xray/xray_profile_collector.cc:31
+
+SpinMutex GlobalMutex;
+struct ThreadTrie {
----------------
pelikan wrote:
> I'm not sure a spinning lock is the best idea when deepCopying a huge tree, but have no data to prove anything.
Note, the intent here is to use the `GlobalMutex` to lock operations on the `ThreadTries` vector. We shouldn't be holding a lock on the `GlobalMutex` while in the process of copying the FunctionCallTrie.

================
Comment at: compiler-rt/lib/xray/xray_profile_collector.h:31-32
+/// data, in a form that allows traversal.
+class ProfileCollectorService {
+public:
+  /// Posts the FunctionCallTrie assocaited with a specific Thread ID. This
----------------
pelikan wrote:
> Why is this a class with public static methods, when it doesn't have any data or subclasses and therefore should be a namespace?
Good point. It started as a class that had member variables, until it evolved to a global implementation, which is better just as a namespace.

================
Comment at: compiler-rt/lib/xray/xray_segmented_array.h:42
+private:
+  struct Chunk {
+    typename AllocatorType::Block Block;
----------------
pelikan wrote:
> So, actually, I never liked linked lists where the prev/next pointers are in a separate region of memory, because that tends to worsen the cache miss rate when you walk through the list, and when the points at which these Chunk things are allocated are reasonably randomized along with the actual data allocations to confuse the CPU prefetcher.  Which is why I've always been using LIST/TAILQ versions from queue(3) as they embed these to the structures they're listing.  I'm not saying you should rewrite all of this now, but have you thought about putting the prev/next into the T somehow?  Is that even possible to do with C++ templates?
What you're talking about is intrusive lists. These work only if you're doing a linked list of elements, but in this case it's the chunks we're linking together. Each chunk will have a block, which is what we're managing. All the chunks come from the same region of memory.

================
Comment at: compiler-rt/lib/xray/xray_segmented_array.h:44
+    typename AllocatorType::Block Block;
+    static constexpr size_t Size = N;
+    Chunk *Prev = nullptr;
----------------
pelikan wrote:
> Why is this necessary, and we can't just use N?
Usability -- because you can do:

```
Chunk C;
assert(C.Size > 0);
```

================
Comment at: compiler-rt/test/xray/TestCases/Posix/profiling-single-threaded.cc:15-16
+
+#define XRAY_ALWAYS_INSTRUMENT [[clang::xray_always_instrument]]
+#define XRAY_NEVER_INSTRUMENT [[clang::xray_never_instrument]]
+
----------------
pelikan wrote:
> These local macros don't make it that much shorter or more readable.  Consider either removing "XRAY_" or dropping them.
Yeah, unfortunately without these clang-format gets confused. :D

https://reviews.llvm.org/D44620