[PATCH] D139367: [CSSPGO][llvm-profgen] A tail call tracker to infer missing tail call frames.

Wed Dec 7 13:18:16 PST 2022

hoy marked 2 inline comments as done.
hoy added inline comments.

================
Comment at: llvm/tools/llvm-profgen/ProfiledBinary.h:252-256
+  // A map of call instructions to their target addresses.
+  std::unordered_multimap<uint64_t, uint64_t> CallTargets;
+
+  // A map of tail call instructions to their target addresses.
+  std::unordered_multimap<uint64_t, uint64_t> TailCallTargets;
----------------
wenlei wrote:
> hoy wrote:
> > wenlei wrote:
> > > 1. Did you choose `unordered_multimap<uint64_t, uint64_t>` over `unordered_map<uint64_t, unordered_set<uint64_t>>` to optimize for mostly 1 target per call? multimap has multiple copies of the keys too, but map doesn't. The later feels a bit cleaner..
> > > 
> > > 2. The name is bit confusing because this contains both call source and target. 
> > > 
> > > 3. I haven't think through this carefully, but these data structures seems quite ad-hoc in that it somewhat overlaps with both CallAddressSet above and the data in the new tracker. I know that each serves slightly different purpose, but it just feels a bit unorganized, and bolted on the way it is now.  
> > > Did you choose unordered_multimap<uint64_t, uint64_t> over unordered_map<uint64_t, unordered_set<uint64_t>> to optimize for mostly 1 target per call? multimap has multiple copies of the keys too, but map doesn't. The later feels a bit cleaner..
> > 
> > Yeah, that's the point. Single targets take 95% and with multipmap it wouldn't bother construct a container for each of those targets. Duplications are handled in `refineTailCallTargets` .
> > 
> > > The name is bit confusing because this contains both call source and target.
> > 
> > How about CallPairs or just CallEdges?
> > 
> > 
> > > I haven't think through this carefully, but these data structures seems quite ad-hoc in that it somewhat overlaps with both CallAddressSet above and the data in the new tracker. I know that each serves slightly different purpose, but it just feels a bit unorganized, and bolted on the way it is now.
> > 
> > Yeah, they are separated mostly because of efficiency. A `unordered_map<uint64_t, unordered_set<uint64_t>>` would unify them but at the expense of efficiency. 
> There's also duplication between ProfiledBinary members and MissingFrameInferrer members. 
The MissingFrameInferrer members are mostly `BinaryFunction` based while the ProfiledBinary members are address based. The separation is to reduce the disassembling time by avoiding the address to BinaryFunction lookup there.

================
Comment at: llvm/tools/llvm-profgen/TailCallTracker.cpp:50
+  // Deduplicate
+  if (!FuncTailCalls.empty()) {
+    auto I = FuncTailCalls.begin();
----------------
wenlei wrote:
> hoy wrote:
> > wenlei wrote:
> > > Why use `unordered_multimap` to allow duplication in the first place? With `unordered_map<key_t, unordered_set<value_t>>` for both `TailCallTargets` and `FuncTailCalls`, there won't be duplicates. 
> > It's about efficiency. 95% of tail calls have a single target, using `unordered_set` for them sounds like unnecessary. WDYT?
> How much performance does it help? The dedup also has cost. I think the use multimap has readability cost in quite a few places (makes things more complicated). one example is the sliding window you used in order to count tail calls. If the perf cost isn't too much, I'd favor simplicity and readability. 
It's hard to measure without implementing the non-multimap version. But I can think that besides the extra constructor call and the extra memory that the container takes, there's also an extra cost in accessing a single element in the container. 

> one example is the sliding window you used in order to count tail calls. If the perf cost isn't too much,

This is debug code. I can rewrite it by converting the multimap to non-multimap, if it helps readability. 

================
Comment at: llvm/tools/llvm-profgen/TailCallTracker.cpp:115
+
+uint64_t TailCallContextTracker::computeUniqueTailCallPath(
+    BinaryFunction *From, BinaryFunction *To, SmallVectorImpl<uint64_t> &Path) {
----------------
wenlei wrote:
> hoy wrote:
> > wenlei wrote:
> > > wenlei wrote:
> > > > The return value can be simplified to use boolean instead of int, because we only care whether there's unique path? 
> > > > 
> > > > We may need to check for cycle at caller side, so we avoid the need to return 0. 
> > > Do we want to limit the level of recursive search? Practically, it's probably very rare to have 3+ back to back tail calls, and three consecutive missing frames due to that. I'm not sure how much this can help though.  
> > > 
> > > If we open this up to general missing frame inference beyond tail call, such pruning might be useful. 
> > > The return value can be simplified to use boolean instead of int, because we only care whether there's unique path?
> > 
> > Hmm, it currently returns three kind of values, 0, 1 and multiple. I'm not seeing how a boolean could make the logic simpler.
> > 
> > > Do we want to limit the level of recursive search? Practically, it's probably very rare to have 3+ back to back tail calls, and three consecutive missing frames due to that. I'm not sure how much this can help though.
> > 
> > Having a limit of the search level makes sense, but I'm not sure what is a good value there. It' not uncommon to have a tail call chain with more than 3 frames. For example, a chain consists of one-liner getter helpers. I can get some numbers and see.
> > 
> Ok, I thought that returning 0 was only for cycles. But now I see that 0 is also possible for unreachable cases. Yeah, we need a tri-state here. Maybe use a proper tri-state return value? The thing is the returned value also isn't really NumPaths because of early bail on NumPaths > 1. 
> 
> I'm fine with what it is now too. 
> Having a limit of the search level makes sense, but I'm not sure what is a good value there. It' not uncommon to have a tail call chain with more than 3 frames. For example, a chain consists of one-liner getter helpers. I can get some numbers and see.

An update on this. The longest tail call path for a medium benchmark is 16. There are 5% paths whose length is larger than 3.  

================
Comment at: llvm/tools/llvm-profgen/TailCallTracker.cpp:154
+    NumPaths += computeUniqueTailCallPath(TailCall->second, To, Path);
+    // Reachable via multiple paths is considered unreachable. Stop analyzing
+    // the remaining if we are already seeing more than one reachable paths.
----------------
wenlei wrote:
> hoy wrote:
> > wenlei wrote:
> > > > Reachable via multiple paths is considered unreachable 
> > > 
> > > This is just non-unique path situation. Why do we need to introduce a notion of "unreachable" while it is actually reachable? This is probably pure terminology issue, but I don't think I understand the rational behind "unreachable".
> > The comment is confusing. `unreachable` has its literal meaning in the algorithm. That reachable via multiple paths has a different meaning. I just simplified the comment to "Stop analyzing the remaining if we are already seeing more than one reachable paths."
> The updated comment looks good to me. And I think I understand what you're doing here, but I'm curious - what is the literal meaning of unreachable in the algorithm you mentioned? I understand that we want to stop processing upon multiple paths, but stop process is different from unreachable, even though unreachable will lead to top processing for sure. 
By literal meaning of unreachable, I mean the there isn't a path available on the dynamic call graph, probably because the call edges are not LBR-sampled. 

================
Comment at: llvm/tools/llvm-profgen/TailCallTracker.cpp:186
+
+uint64_t TailCallContextTracker::inferMissingFrames(
+    uint64_t From, uint64_t To, SmallVectorImpl<uint64_t> &UniquePath) {
----------------
wenlei wrote:
> hoy wrote:
> > wenlei wrote:
> > > It doesn't seem like we need to return num of path here. We bail early when `NumPaths > 1`, so the number isn't accurate anyways. At most, a boolean indicated whether we found a unique path should be enough. 
> > Not sure a boolean is enough. E.g, it wouldn't differentiate the following two cases when inferring missing frames between A -> F 
> > 
> > - Case 1 with actually paths:
> > 
> > ```
> > A -> B -> C -> F
> > 
> > A -> B -> D -> F
> > 
> > A -> E -> F
> > ```
> > 
> > The integer return value can tell that we should stop analyzing the path `A -> E -> F` because there are already multiple available paths. Returning 0 or false would not tell it apart from the case below.
> > 
> > - Case 2 with actually paths:
> > 
> > ```
> > A -> B -> C 
> > 
> > A -> E -> F
> > ```
> > 
> > 
> I see the need for tri-state, but this one is a bit different from the return value of `computeUniqueTailCallPath`. Here, what I really meant is that you didn't use the return value of `inferMissingFrames` anywhere, or did miss anything? I think that having a boolean to indicate whether we successfully inferred something could be useful even if we aren't using it now.
Oh yeah, the return value of `inferMissingFrames` is not used anywhere. Changed it to boolean type.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D139367/new/

https://reviews.llvm.org/D139367