[PATCH] D94110: [CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation

Thu Jan 21 23:24:36 PST 2021

wenlei added a comment.

In D94110#2494254 <https://reviews.llvm.org/D94110#2494254>, @wlei wrote:

> In D94110#2489390 <https://reviews.llvm.org/D94110#2489390>, @wmi wrote:
>
>>> This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it.
>>
>> 5x speedup shows it is a really impressive improvement. I am wondering whether there is callstack overlap between different LBR samples so you can have further grouping of call frames -- by reusing unwindState. You may also save some cost by reusing the frame trie. IIUC although samples have been aggregated based on callstack, each LBR sample may have multiple callstacks inferred from unwindCall/unwindReturn. If there are callstack overlap between different LBR samples, you may be able to further group them.

Yeah, the aggregation was based on stack + LBR sample, so events with same stack but different LBR won't be aggregated today. Using a global trie would help defer context generation for each of the aggregations if they lead to shared context..

I think we could experiment how helpful that is by checking how often we generate a context that's already in the context profile map when traversing the trie. If that happens very often, then it suggests global trie could save more..

> Yes, this is a very good idea, thanks. Further grouping of call frames might allow us not needing the aggregated samples map, i.e. we can do profile generation on the tree during DFS.

I think the aggregation is still worth keeping as it's probably still cheaper than trie with hashing. But we will know for sure through experiments.

> One of my concern on this is that if we maintain a whole tree of call frame, it might affect the performance of getOrCreateNode() during the unwinding, because it will have more children of tree node increasing the hash collision. but if the overlapping degree is high, it will surely improve the performance. I will try this and do more experiments on the heavy workloads in SPEC.

================
Comment at: llvm/tools/llvm-profgen/PerfReader.cpp:123
+      } else {
+        CallStack.push_back(reinterpret_cast<uint64_t>(CallProbe));
+      }
----------------
Would be nice to have a unified representation for frame stack and probe stack, but without replying on reinterpret_cast of probe pointer.. 

Using uint64_t then rely on Binary->usePseudoProbes() to decide how to interpret the value seem less than ideal.. Do this through template functions with the different part done through specialization helpers? 

================
Comment at: llvm/tools/llvm-profgen/PerfReader.h:204
+  // Call stack trie node
+  struct Frame {
+    const uint64_t Address = 0;
----------------
What about name it `ProfiledFrame`? (we have ProfileBinary)

================
Comment at: llvm/tools/llvm-profgen/PerfReader.h:206
+    const uint64_t Address = 0;
+    Frame *Parent = nullptr;
+    SampleVector RangeSamples;
----------------
Raw pointer is used for parent, but unique_ptr is used for child. Would be good to keep consistent.

================
Comment at: llvm/tools/llvm-profgen/PerfReader.h:211
+
+    Frame() {}
+    Frame(uint64_t Addr) : Address(Addr) {}
----------------
This ctor can be merged into the one below with default parameter.

================
Comment at: llvm/tools/llvm-profgen/PerfReader.h:281
+
+  void initCallStackTrie(const SmallVectorImpl<uint64_t> &CallStack) {
+    Frame *Cur = &DummyTrieRoot;
----------------
A few more naming nits: 

initFrameTrie? CallStackTrie prompts a trie whose nodes are call stack.. 

appendFrame and popStackLeaf are functionally symmetric, would be good for them to follow similar names. e.g. pushFrame, popFrame. and then swithToFrame..

================
Comment at: llvm/tools/llvm-profgen/PerfReader.h:419
   void unwindBranchWithinFrame(UnwindState &State);
+  void recordSampleWithinFrame(UnwindState::Frame *Cur,
+                               SmallVectorImpl<uint64_t> &CallStack);
----------------
We "record" samples on trie (have recordRangeCount and recordBranchCount), then "collect" samples from trie into context sample maps via a DFS traversal.

So we could have a more intuitive convention for readability:

- ProfiledFrame::recordRangeCount
- ProfiledFrame::recordBranchCount
- VirtualUnwinder::collectSamplesFromFrame
- VirtualUnwinder::collectSamplesFromFrameTrie

================
Comment at: llvm/tools/llvm-profgen/PerfReader.h:441-442
   // ProfileGenerator
-  SampleCounter &getOrCreateCounterForProbe(const ProfiledBinary *Binary,
-                                            std::list<uint64_t> &CallStack);
+  SampleCounter &
+  getOrCreateCounterForProbe(SmallVectorImpl<uint64_t> &CallStack);

----------------
We can make this a private member since it's only a helper for getOrCreateCounter. (a few other functions can be private too). 

================
Comment at: llvm/tools/llvm-profgen/PerfReader.h:227
+  Frame FrameTrieRoot;
+  Frame *StackLeaf;
   // Used to fall through the LBR stack
----------------
wlei wrote:
> hoy wrote:
> > Nit: name it `FrameCurrentLeaf`?
> Good point
nit: name them `CurrentLeafFrame` and `DummyRootFrame`? 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D94110/new/

https://reviews.llvm.org/D94110