[llvm] [memprof] Add CallStackRadixTreeBuilder (PR #93784)

Thu Jun 6 13:30:18 PDT 2024

================
@@ -410,6 +410,153 @@ CallStackId hashCallStack(ArrayRef<FrameId> CS) {
   return CSId;
 }
 
+// Encode a call stack into RadixArray.  Return the starting index within
+// RadixArray.  For each call stack we encode, we emit two or three components
+// into RadixArray.  If a given call stack doesn't have a common prefix relative
+// to the previous one, we emit:
+//
+// - the frames in the given call stack in the root-to-leaf order
+//
+// - the length of the given call stack
+//
+// If a given call stack has a non-empty common prefix relative to the previous
+// one, we emit:
+//
+// - the relative location of the common prefix, encoded as a negative number.
+//
+// - a portion of the given call stack that's beyond the common prefix
+//
+// - the length of the given call stack, including the length of the common
+//   prefix.
+//
+// The resulting RadixArray requires a somewhat unintuitive backward traversal
+// to reconstruct a call stack -- read the call stack length and scan backward
+// while collecting frames in the leaf to root order.  build, the caller of this
+// function, reverses RadixArray in place so that we can reconstruct a call
+// stack as if we were deserializing an array in a typical way -- the call stack
+// length followed by the frames in the leaf-to-root order except that we need
+// to handle pointers to parents along the way.
+//
+// To quickly determine the location of the common prefix within RadixArray,
+// Indexes caches the indexes of the previous call stack's frames within
+// RadixArray.
+LinearCallStackId CallStackRadixTreeBuilder::encodeCallStack(
+    const llvm::SmallVector<FrameId> *CallStack,
+    const llvm::SmallVector<FrameId> *Prev,
+    const llvm::DenseMap<FrameId, LinearFrameId> &MemProfFrameIndexes) {
+  // Compute the length of the common root prefix between Prev and CallStack.
+  uint32_t CommonLen = 0;
+  if (Prev) {
+    auto Pos = std::mismatch(Prev->rbegin(), Prev->rend(), CallStack->rbegin(),
+                             CallStack->rend());
+    CommonLen = std::distance(CallStack->rbegin(), Pos.second);
+  }
+
+  // Drop the portion beyond CommonLen.
+  assert(CommonLen <= Indexes.size());
+  Indexes.resize(CommonLen);
+
+  // Append a pointer to the parent.
+  if (CommonLen) {
+    uint32_t CurrentIndex = RadixArray.size();
+    uint32_t ParentIndex = Indexes.back();
+    // The offset to the parent must be negative because we are pointing to an
+    // element we've already added to RadixArray.
+    assert(ParentIndex < CurrentIndex);
+    RadixArray.push_back(ParentIndex - CurrentIndex);
+  }
+
+  // Copy the part of the call stack beyond the common prefix to RadixArray.
+  assert(CommonLen <= CallStack->size());
+  for (FrameId F : llvm::drop_begin(llvm::reverse(*CallStack), CommonLen)) {
+    // Remember the index of F in RadixArray.
+    Indexes.push_back(RadixArray.size());
+    RadixArray.push_back(MemProfFrameIndexes.find(F)->second);
+  }
+  assert(CallStack->size() == Indexes.size());
+
+  // End with the call stack length.
+  RadixArray.push_back(CallStack->size());
+
+  // Return the index within RadixArray where we can start reconstructing a
+  // given call stack from.
+  return RadixArray.size() - 1;
+}
+
+void CallStackRadixTreeBuilder::build(
+    llvm::MapVector<CallStackId, llvm::SmallVector<FrameId>>
+        &&MemProfCallStackData,
+    const llvm::DenseMap<FrameId, LinearFrameId> &MemProfFrameIndexes) {
+  // Take the vector portion of MemProfCallStackData.  The vector is exactly
+  // what we need to sort.  Also, we no longer need its lookup capability.
+  llvm::SmallVector<CSIdPair, 0> CallStacks = MemProfCallStackData.takeVector();
+
+  // Sort the list of call stacks in the dictionary order to maximize the length
+  // of the common prefix between two adjacent call stacks.
+  llvm::sort(CallStacks, [&](const CSIdPair &L, const CSIdPair &R) {
+    // Call stacks are stored from leaf to root.  Perform comparisons from the
+    // root.
+    return std::lexicographical_compare(
+        L.second.rbegin(), L.second.rend(), R.second.rbegin(), R.second.rend(),
+        [&](FrameId F1, FrameId F2) { return F1 < F2; });
+  });
+
+  // Reserve some reasonable amount of storage.
+  RadixArray.clear();
+  RadixArray.reserve(CallStacks.size() * 8);
+
+  // Indexes will grow as long as the longest call stack.
+  Indexes.clear();
+  Indexes.reserve(512);
+
+  // Compute the radix array.  We encode one call stack at a time, computing the
+  // longest prefix that's shared with the previous call stack we encode.  For
+  // each call stack we encode, we remember a mapping from CallStackId to its
+  // position within RadixArray.
+  //
+  // As an optimization, we encode from the last call stack in CallStacks to
+  // reduce the number of times we follow pointers to the parents.  Consider the
+  // list of call stacks that has been sorted in the dictionary order:
+  //
+  // Call Stack 1: F1
+  // Call Stack 2: F1 -> F2
+  // Call Stack 3: F1 -> F2 -> F3
+  //
+  // If we traversed CallStacks in the forward order, we would end up with a
+  // radix tree like:
+  //
+  // Call Stack 1:  F1
+  //                |
+  // Call Stack 2:  +---> F2
+  //                      |
+  // Call Stack 3:        +---> F3
+  //
+  // Notice that each call stack jumps to the previous one.  However, if we
+  // traverse CallStacks in the reverse order, then Call Stack 3 has the
+  // complete call stack encoded without any pointers.  Call Stack 1 and 2 point
+  // to appropriate prefixes of Call Stack 3.
+  const llvm::SmallVector<FrameId> *Prev = nullptr;
+  for (const auto &[CSId, CallStack] : llvm::reverse(CallStacks)) {
+    LinearCallStackId Pos =
+        encodeCallStack(&CallStack, Prev, MemProfFrameIndexes);
+    CallStackPos.insert({CSId, Pos});
+    Prev = &CallStack;
+  }
+
+  if (RadixArray.size() >= 2) {
----------------
teresajohnson wrote:

What happens if RadixArray.size() < 2? If it is 0, is that a case where we could/should be returning early from this function? If 1, afaict the loops below will end up as no-ops (first shouldn't execute, second I believe will just set 0 to -0 if I am following how the CallStackPos would be set up in that case).

https://github.com/llvm/llvm-project/pull/93784