[PATCH] D44620: [XRay][compiler-rt] XRay Profiling Mode

Martin Pelikán via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Wed Apr 11 12:34:18 PDT 2018


pelikan added a comment.

Sorry for the late review, and for destroying your diff.

What I think would also make interesting test cases:

- when the call sequence is A → B → C → setjmp(3) ↓ B ↓ A → D → longjmp(3), where an exit from A would be after N times the loop ran
- or (alternatively) strange situations involving C++ exceptions



================
Comment at: compiler-rt/lib/xray/tests/unit/allocator_test.cc:36
+  auto B1 = A.Allocate();
+  (void)B1;
+  auto B2 = A.Allocate();
----------------
ASSERT_EQ(A.Counter, 1);


================
Comment at: compiler-rt/lib/xray/tests/unit/allocator_test.cc:39
+  ASSERT_EQ(B2.Data, nullptr);
+}
+
----------------
ASSERT_EQ(A.Counter, 1);


================
Comment at: compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc:39-40
+
+  Trie.enterFunction(1, 1);
+  Trie.exitFunction(1, 2);
+
----------------
For readability, can we have the TSCs to be 100, 200, 300 etc.?  Now the numbers look the same.  (and I see a test below does that already)


================
Comment at: compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc:91-92
+
+// While missing an intermediary entry may be impossible in practice, we still
+// enforce that we can handle the case where we've missed the entry event
+// somehow, in between call entry/exits. To illustrate, imagine the following
----------------
f0 → f1 → setjmp ↓(f1→f0) longjmp ↓(f1↓f0) longjmp ↓(f1↓f0) should generate lots of exits but only one entry.  Or when the profiling starts in a signal handler.

So it shouldn't be "impossible", just "infrequent" :-)


================
Comment at: compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc:106-110
+  Trie.enterFunction(1, 0);
+  Trie.enterFunction(2, 100);
+  Trie.enterFunction(3, 200);
+  Trie.exitFunction(3, 300);
+  Trie.exitFunction(1, 400);
----------------
Test name says "MissingIntermediaryEntry" but this is missing an intermediary *exit*.


================
Comment at: compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc:145-150
+  Trie.enterFunction(1, 0);
+  Trie.enterFunction(2, 1);
+  Trie.exitFunction(2, 2);
+  Trie.enterFunction(3, 3);
+  Trie.exitFunction(3, 4);
+  Trie.exitFunction(1, 5);
----------------
Same, please use multiplies of 100 for TSCs.  I wonder whether we should test the TSC time series not being non-decreasing due to TSC mismatches when rescheduling among poorly synchronized CPU packages.


================
Comment at: compiler-rt/lib/xray/tests/unit/profile_collector_test.cc:21
+TEST(ProfileCollectorServiceTest, PostSerializeCollect) {
+  // The most basic use-case (the one we actually only care about) is the one
+  // where we ensure that we can post FunctionCallTrie instances, which are then
----------------
did you mean: "the only one we actually care about"?

If we "only care" about it, what more can we do about it? :-)


================
Comment at: compiler-rt/lib/xray/tests/unit/profile_collector_test.cc:47
+
+  ASSERT_NE(B.Data, nullptr);
+}
----------------
I would at least assert the buffer's size is within some reasonable bounds - has the trailing bit and a function list.  Maybe also the zero sentinels in places.


================
Comment at: compiler-rt/lib/xray/tests/unit/profile_collector_test.cc:77
+  auto B = ProfileCollectorService::nextBuffer({nullptr, 0});
+  ASSERT_NE(B.Data, nullptr);
+}
----------------
Again, some assertions to make sure both threads are reflected in that buffer would be nice.  Doesn't have to be too strict.


================
Comment at: compiler-rt/lib/xray/tests/unit/segmented_array_test.cc:71
+
+TEST(SegmentedArrayTest, Iterators) {
+  Array<TestData> data;
----------------
Please also test what happens when you do

Array <TestData> data;
auto it = data.begin();
it--;

Because I think you'll find Offset will be SIZE_MAX.  Not sure we want that.


================
Comment at: compiler-rt/lib/xray/xray_allocator.h:12-13
+//
+// Defines the allocator interface for an arena allocator, used primarily for
+// the profiling runtime.
+//
----------------
I would at least add a TODO for adding support for replacing this allocator with any of the security-checking ones.  (ubsan?  efence?  valgrind?  I keep forgetting the names of them.  compiler-rt IIRC has some "secure" allocator too.)

We don't want to become another OpenSSL.


================
Comment at: compiler-rt/lib/xray/xray_allocator.h:66
+    // might store up to 16 pointers.
+    static constexpr auto Size = (kCacheLineSize / sizeof(Block)) - 3;
+
----------------
static_assert(Size <= 16); to make sure we don't run out of bits when a CPU manufacturer goes crazy?


================
Comment at: compiler-rt/lib/xray/xray_allocator.h:94
+    size_t Offset = 0;
+    for (auto &B : NewChain->Blocks) {
+      B.Data = BackingStore + Offset;
----------------
assume NewChain == nullptr.  (or BackingStore for that matter)


================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:24-27
+/// A FunctionCallTrie represents the stack traces that we've encountered of
+/// XRay instrumented functions, by representing each function as a node in the
+/// trie.  The path from the root of the trie to any particular node represents
+/// a stack trace. Each node in the trie will contain some useful values,
----------------
I'd reword/reorder this slightly, to make it clearer what's actually being stored.

FunctionCallTrie represents stack traces of XRay instrumented functions that we've encountered, where a node corresponds to a function call and the path from the root to that node represents its stack trace.


================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:84
+/// currently executing path, and on function exits quickly compute the amount
+/// of time elapsed from the entry then update the counters for the node already
+/// represented in the trie. This necessitates an efficient representation of
----------------
IIUC, a comma before "then", or a new sentence.


================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:90
+/// minimum.
+///
+class FunctionCallTrie {
----------------
New line not necessary?


================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:116-117
+    NodeRefArray Callees;
+    int64_t CallCount;
+    int64_t CumulativeLocalTime;
+    int32_t FId;
----------------
Why are these not unsigned?  There's no such thing as a negative count or negative time spent in a function.  ShadowStackEntry has the time as u64.

(I'd make the function ID unsigned too.)


================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:117
+    int64_t CallCount;
+    int64_t CumulativeLocalTime;
+    int32_t FId;
----------------
Please put something like "// TSC ticks" at the end of the line, or introduce u64 typedefs for TSC ticks/deltas to avoid someone mistaking it for nanoseconds or the like.


================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:131
+  struct ShadowStackEntry {
+    int32_t FId;
+    uint64_t EntryTSC;
----------------
It's not clear to me why does this need FId when the Node pointer below has it as well.  Please add a brief comment if it's really necessary.


================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:152
+    using ShadowStackAllocatorType = ShadowStackArray::AllocatorType;
+    using NodeRefAllocatorType = NodeRefAllocatorType;
+
----------------
Um, why is this line necessary? :-)  Line 230 should work without Allocators:: too.


================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:255-256
+  void enterFunction(int32_t FId, uint64_t TSC) {
+    // This function primarily deals with ensuring that the ShadowStack is
+    // consistent and ready for when an exit event is encountered.
+    if (UNLIKELY(ShadowStack.empty())) {
----------------
Why is the comment related to the function not above the function's first line?  Same for exitFunction().


================
Comment at: compiler-rt/lib/xray/xray_function_call_trie.h:315-319
+  // The deepCopyInto operation will update the provided FunctionCallTrie by
+  // re-creating the contents of this particular FunctionCallTrie in the other
+  // FunctionCallTrie. It will do this using a Depth First Traversal from the
+  // roots, and while doing so recreating the traversal in the provided
+  // FunctionCallTrie.
----------------
It's not clear to me from this comment that this function, unlike mergeInto, may create duplicate entries.


================
Comment at: compiler-rt/lib/xray/xray_profile_collector.cc:31
+
+SpinMutex GlobalMutex;
+struct ThreadTrie {
----------------
I'm not sure a spinning lock is the best idea when deepCopying a huge tree, but have no data to prove anything.


================
Comment at: compiler-rt/lib/xray/xray_profile_collector.cc:50
+
+FunctionCallTrie::Allocators *volatile GlobalAllocators = nullptr;
+
----------------
Why do we need the volatile?  It's a global, there's very little optimization the compiler can do anyway...  I'd like to see what I missed, thinking it'd be OK without it.


================
Comment at: compiler-rt/lib/xray/xray_profile_collector.cc:129
+    // List of IDs follow:
+    for (const int32_t FId : *Record.Path)
+      NextPtr =
----------------
Why not just "auto FId"?


================
Comment at: compiler-rt/lib/xray/xray_profile_collector.cc:134-138
+    // Add the sentinel here.
+    static constexpr int32_t SentinelFId = 0;
+    NextPtr = static_cast<char *>(
+                  internal_memcpy(NextPtr, &SentinelFId, sizeof(SentinelFId))) +
+              sizeof(SentinelFId);
----------------
Are you sure "static" is needed here?  With just a local zero variable, the compiler may inline the memcpy and turn it into memset, whereas you're telling it to load a thing far away in memory.

That said, I reckon "internal_memset(NextPtr, 0, 4); NextPtr += 4;" would be both easier to read and faster.


================
Comment at: compiler-rt/lib/xray/xray_profile_collector.cc:151-154
+    static constexpr uint64_t EndOfRecord = 0x0;
+    NextPtr = static_cast<char *>(
+                  internal_memcpy(NextPtr, &EndOfRecord, sizeof(EndOfRecord))) +
+              sizeof(EndOfRecord);
----------------
Same here, memset(NextPtr, 0, 8); NextPtr += 8;


================
Comment at: compiler-rt/lib/xray/xray_profile_collector.h:10
+//
+// This file is a part of XRay, a dynamic runtime instruementation system.
+//
----------------
instruementation has an typoe in it -> instrumentation

Please fix other files as well.


================
Comment at: compiler-rt/lib/xray/xray_profile_collector.h:31-32
+/// data, in a form that allows traversal.
+class ProfileCollectorService {
+public:
+  /// Posts the FunctionCallTrie assocaited with a specific Thread ID. This
----------------
Why is this a class with public static methods, when it doesn't have any data or subclasses and therefore should be a namespace?


================
Comment at: compiler-rt/lib/xray/xray_segmented_array.h:42
+private:
+  struct Chunk {
+    typename AllocatorType::Block Block;
----------------
So, actually, I never liked linked lists where the prev/next pointers are in a separate region of memory, because that tends to worsen the cache miss rate when you walk through the list, and when the points at which these Chunk things are allocated are reasonably randomized along with the actual data allocations to confuse the CPU prefetcher.  Which is why I've always been using LIST/TAILQ versions from queue(3) as they embed these to the structures they're listing.  I'm not saying you should rewrite all of this now, but have you thought about putting the prev/next into the T somehow?  Is that even possible to do with C++ templates?


================
Comment at: compiler-rt/lib/xray/xray_segmented_array.h:44
+    typename AllocatorType::Block Block;
+    static constexpr size_t Size = N;
+    Chunk *Prev = nullptr;
----------------
Why is this necessary, and we can't just use N?


================
Comment at: compiler-rt/lib/xray/xray_segmented_array.h:61
+    // TODO: Maybe use a separate managed allocator for Chunk instances?
+    auto C = reinterpret_cast<Chunk *>(InternalAlloc(sizeof(Chunk)));
+    C->Block = Block;
----------------
InternalAlloc can return nullptr.


================
Comment at: compiler-rt/lib/xray/xray_segmented_array.h:153
+    const T &operator*() const {
+      // DCHECK_NE(C, nullptr);
+      DCHECK_NE(C, &SentinelChunk);
----------------
I suppose these were your debugging statements which can go away (and below).


================
Comment at: compiler-rt/lib/xray/xray_segmented_array.h:242
+    DCHECK_NE(Tail, &SentinelChunk);
+    DCHECK_GE(Size, 0u);
+    auto Offset = (Size - 1) % N;
----------------
tautology


================
Comment at: compiler-rt/test/xray/TestCases/Posix/profiling-single-threaded.cc:15-16
+
+#define XRAY_ALWAYS_INSTRUMENT [[clang::xray_always_instrument]]
+#define XRAY_NEVER_INSTRUMENT [[clang::xray_never_instrument]]
+
----------------
These local macros don't make it that much shorter or more readable.  Consider either removing "XRAY_" or dropping them.


https://reviews.llvm.org/D44620





More information about the llvm-commits mailing list