[PATCH] D94110: [CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation

Tue Jan 26 23:25:18 PST 2021

wenlei added a comment.

In D94110#2523532 <https://reviews.llvm.org/D94110#2523532>, @wlei wrote:

>> I think the aggregation is still worth keeping as it's probably still cheaper than trie with hashing. But we will know for sure through experiments.
>
>
>
> In D94110#2514693 <https://reviews.llvm.org/D94110#2514693>, @wenlei wrote:
>
>> With the latest, do you see similar speed up for probe profile and dwarf profile?
>
> have implemented the global trie for the virtual unwinder part and done the evaluation over the LBR trie against baseline and global trie on probe profile. See the chart below, here only use `gobmk` and `sjeng` because other benchmarks's run time of virtual unwinder are very small(less than 1s).
>
> The virtual unwinding time(s):
>
> |       | No-Trie | LBR-Trie | Global-Trie |
> | ----- | ------- | -------- | ----------- |
> | gobmk | 8.31    | 3.6      | 4.11        |
> | sjeng | 46.82   | 19.15    | 24.91       |
> |
>
> Speed up:
>
> |       | LBR-Trie vs No-Trie | Global-Trie vs No-Trie | Global-Trie vs LBR-Trie |
> | ----- | ------------------- | ---------------------- | ----------------------- |
> | gobmk | 2.3                 | 2.02                   | 0.88                    |
> | sjeng | 2.44                | 1.87                   | 0.77                    |
> |
>
> Sum-up:
>
> - LBR-Trie and Global-Trie can have about 2x speed-up over the baseline
> - Global-Trie have slight regression(about 10%) against LBR-Trie as we discussed this might be caused by hash overhead.
> - Didn't see the similar speed-up as our internal prototype, I guess it's because of the refinement we did in the previous patches like removing the redundant string concatenation and reversal, switching to use probe based key.
>
> Beside the time, LBR-trie seems good for the readability but Global-Trie can be easy to detect shared context. Haven't tried to defer the context generation(should have more speed-up for Global-Trie) and detect shared context, will try it.
>
> CC: @wmi  @hoy

Thanks for the quick experiment! Given that we don't see immediate speed up from global trie, I'm inclined to just use what you have in this patch, and defer further improvement for the future. What do you think?

================
Comment at: llvm/tools/llvm-profgen/PerfReader.h:229
+  ProfiledFrame DummyTrieRoot;
+  ProfiledFrame *FrameCurrentLeaf;
   // Used to fall through the LBR stack
----------------
FrameCurrentLeaf -> CurrentLeafFrame?

================
Comment at: llvm/tools/llvm-profgen/PerfReader.h:484
                          uint64_t Repeat);
-  SampleCounter &getOrCreateCounter(const ProfiledBinary *Binary,
-                                    std::list<uint64_t> &CallStack);
-  // Use pseudo probe based context key to get the sample counter
-  // A context stands for a call path from 'main' to an uninlined
-  // callee with all inline frames recovered on that path. The probes
-  // belonging to that call path is the probes either originated from
-  // the callee or from any functions inlined into the callee. Since
-  // pseudo probes are organized in a tri-tree style after decoded,
-  // the tree path from the tri-tree root (which is the uninlined
-  // callee) to the probe node forms an inline context.
-  // Here we use a list of probe(pointer) as the context key to speed up
-  // aggregation and the final context string will be generate in
-  // ProfileGenerator
-  SampleCounter &getOrCreateCounterForProbe(const ProfiledBinary *Binary,
-                                            std::list<uint64_t> &CallStack);
-
 private:
   ContextSampleCounterMap *CtxCounterMap;
----------------
this private can be removed since there's one on line 456.

================
Comment at: llvm/tools/llvm-profgen/PerfReader.h:206
+    const uint64_t Address = 0;
+    Frame *Parent = nullptr;
+    SampleVector RangeSamples;
----------------
wlei wrote:
> wenlei wrote:
> > Raw pointer is used for parent, but unique_ptr is used for child. Would be good to keep consistent.
> Oh, the child is a new allocated instance so we use smart pointer for it, but the parent pointer can always point to a pre-allocated instance or null(for root). Also seems it would cause a recursive deconstruction of unique_ptr error if making parent unique_ptr.
Definitely not unique_ptr for parent and multiple children would keep pointer to same parent. I meant to say consistency between smart pointer vs raw pointer, and in the case of parent, a shared_ptr?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D94110/new/

https://reviews.llvm.org/D94110