[PATCH] D139367: [CSSPGO][llvm-profgen] A tail call tracker to infer missing tail call frames.

Wed Dec 7 12:44:29 PST 2022

wenlei added a comment.

Thanks for the changes and addressing the feedbacks. Can you also rename all the instance of tail call tracker accordingly, for change description as well as title? And to be consistent, use simple terms instead of "materialized" in change description too?

================
Comment at: llvm/tools/llvm-profgen/ProfileGenerator.cpp:827
+        assert(CallTargets.count(From) == 1 &&
+               "A callsite should only appear once with either a known a zero "
+               "target value at this point");
----------------
hoy wrote:
> wenlei wrote:
> > > either a known a zero
> > typo? I also don't understand this message. 
> should be "a known or a zero"
maybe "a known or a zero (unknown)"?

================
Comment at: llvm/tools/llvm-profgen/ProfiledBinary.h:252-256
+  // A map of call instructions to their target addresses.
+  std::unordered_multimap<uint64_t, uint64_t> CallTargets;
+
+  // A map of tail call instructions to their target addresses.
+  std::unordered_multimap<uint64_t, uint64_t> TailCallTargets;
----------------
hoy wrote:
> wenlei wrote:
> > 1. Did you choose `unordered_multimap<uint64_t, uint64_t>` over `unordered_map<uint64_t, unordered_set<uint64_t>>` to optimize for mostly 1 target per call? multimap has multiple copies of the keys too, but map doesn't. The later feels a bit cleaner..
> > 
> > 2. The name is bit confusing because this contains both call source and target. 
> > 
> > 3. I haven't think through this carefully, but these data structures seems quite ad-hoc in that it somewhat overlaps with both CallAddressSet above and the data in the new tracker. I know that each serves slightly different purpose, but it just feels a bit unorganized, and bolted on the way it is now.  
> > Did you choose unordered_multimap<uint64_t, uint64_t> over unordered_map<uint64_t, unordered_set<uint64_t>> to optimize for mostly 1 target per call? multimap has multiple copies of the keys too, but map doesn't. The later feels a bit cleaner..
> 
> Yeah, that's the point. Single targets take 95% and with multipmap it wouldn't bother construct a container for each of those targets. Duplications are handled in `refineTailCallTargets` .
> 
> > The name is bit confusing because this contains both call source and target.
> 
> How about CallPairs or just CallEdges?
> 
> 
> > I haven't think through this carefully, but these data structures seems quite ad-hoc in that it somewhat overlaps with both CallAddressSet above and the data in the new tracker. I know that each serves slightly different purpose, but it just feels a bit unorganized, and bolted on the way it is now.
> 
> Yeah, they are separated mostly because of efficiency. A `unordered_map<uint64_t, unordered_set<uint64_t>>` would unify them but at the expense of efficiency. 
There's also duplication between ProfiledBinary members and MissingFrameInferrer members. 

================
Comment at: llvm/tools/llvm-profgen/TailCallTracker.cpp:50
+  // Deduplicate
+  if (!FuncTailCalls.empty()) {
+    auto I = FuncTailCalls.begin();
----------------
hoy wrote:
> wenlei wrote:
> > Why use `unordered_multimap` to allow duplication in the first place? With `unordered_map<key_t, unordered_set<value_t>>` for both `TailCallTargets` and `FuncTailCalls`, there won't be duplicates. 
> It's about efficiency. 95% of tail calls have a single target, using `unordered_set` for them sounds like unnecessary. WDYT?
How much performance does it help? The dedup also has cost. I think the use multimap has readability cost in quite a few places (makes things more complicated). one example is the sliding window you used in order to count tail calls. If the perf cost isn't too much, I'd favor simplicity and readability. 

================
Comment at: llvm/tools/llvm-profgen/TailCallTracker.cpp:60
+#if LLVM_ENABLE_STATS
+  // Counting number of function with single/mutiple tail callsites.
+  if (FuncTailCalls.size() == 1) {
----------------
typo: mutiple -> multiple

================
Comment at: llvm/tools/llvm-profgen/TailCallTracker.cpp:61-63
+  if (FuncTailCalls.size() == 1) {
+    TailCallFuncSingleTailCalls++;
+  } else if (FuncTailCalls.size() > 1) {
----------------
hoy wrote:
> wenlei wrote:
> > `FuncTailCalls` contains all functions and their associated tail calls, and it's not a per-function map. Why size 1 means a single tail call function?
> Size 1 means there is only one tail call in the program. It in turn means there's only one function having single tail call.
Ok, I think the confusion comes from multimap, where one element means not only one function, but also one tail call in the program. And this special case is needed for the sliding window below to work.

How much does it help for performance by using multimap? I think that readability will improve if you use a map of a set instead. 

================
Comment at: llvm/tools/llvm-profgen/TailCallTracker.cpp:115
+
+uint64_t TailCallContextTracker::computeUniqueTailCallPath(
+    BinaryFunction *From, BinaryFunction *To, SmallVectorImpl<uint64_t> &Path) {
----------------
hoy wrote:
> wenlei wrote:
> > wenlei wrote:
> > > The return value can be simplified to use boolean instead of int, because we only care whether there's unique path? 
> > > 
> > > We may need to check for cycle at caller side, so we avoid the need to return 0. 
> > Do we want to limit the level of recursive search? Practically, it's probably very rare to have 3+ back to back tail calls, and three consecutive missing frames due to that. I'm not sure how much this can help though.  
> > 
> > If we open this up to general missing frame inference beyond tail call, such pruning might be useful. 
> > The return value can be simplified to use boolean instead of int, because we only care whether there's unique path?
> 
> Hmm, it currently returns three kind of values, 0, 1 and multiple. I'm not seeing how a boolean could make the logic simpler.
> 
> > Do we want to limit the level of recursive search? Practically, it's probably very rare to have 3+ back to back tail calls, and three consecutive missing frames due to that. I'm not sure how much this can help though.
> 
> Having a limit of the search level makes sense, but I'm not sure what is a good value there. It' not uncommon to have a tail call chain with more than 3 frames. For example, a chain consists of one-liner getter helpers. I can get some numbers and see.
> 
Ok, I thought that returning 0 was only for cycles. But now I see that 0 is also possible for unreachable cases. Yeah, we need a tri-state here. Maybe use a proper tri-state return value? The thing is the returned value also isn't really NumPaths because of early bail on NumPaths > 1. 

I'm fine with what it is now too. 

================
Comment at: llvm/tools/llvm-profgen/TailCallTracker.cpp:154
+    NumPaths += computeUniqueTailCallPath(TailCall->second, To, Path);
+    // Reachable via multiple paths is considered unreachable. Stop analyzing
+    // the remaining if we are already seeing more than one reachable paths.
----------------
hoy wrote:
> wenlei wrote:
> > > Reachable via multiple paths is considered unreachable 
> > 
> > This is just non-unique path situation. Why do we need to introduce a notion of "unreachable" while it is actually reachable? This is probably pure terminology issue, but I don't think I understand the rational behind "unreachable".
> The comment is confusing. `unreachable` has its literal meaning in the algorithm. That reachable via multiple paths has a different meaning. I just simplified the comment to "Stop analyzing the remaining if we are already seeing more than one reachable paths."
The updated comment looks good to me. And I think I understand what you're doing here, but I'm curious - what is the literal meaning of unreachable in the algorithm you mentioned? I understand that we want to stop processing upon multiple paths, but stop process is different from unreachable, even though unreachable will lead to top processing for sure. 

================
Comment at: llvm/tools/llvm-profgen/TailCallTracker.cpp:186
+
+uint64_t TailCallContextTracker::inferMissingFrames(
+    uint64_t From, uint64_t To, SmallVectorImpl<uint64_t> &UniquePath) {
----------------
hoy wrote:
> wenlei wrote:
> > It doesn't seem like we need to return num of path here. We bail early when `NumPaths > 1`, so the number isn't accurate anyways. At most, a boolean indicated whether we found a unique path should be enough. 
> Not sure a boolean is enough. E.g, it wouldn't differentiate the following two cases when inferring missing frames between A -> F 
> 
> - Case 1 with actually paths:
> 
> ```
> A -> B -> C -> F
> 
> A -> B -> D -> F
> 
> A -> E -> F
> ```
> 
> The integer return value can tell that we should stop analyzing the path `A -> E -> F` because there are already multiple available paths. Returning 0 or false would not tell it apart from the case below.
> 
> - Case 2 with actually paths:
> 
> ```
> A -> B -> C 
> 
> A -> E -> F
> ```
> 
> 
I see the need for tri-state, but this one is a bit different from the return value of `computeUniqueTailCallPath`. Here, what I really meant is that you didn't use the return value of `inferMissingFrames` anywhere, or did miss anything? I think that having a boolean to indicate whether we successfully inferred something could be useful even if we aren't using it now.

================
Comment at: llvm/tools/llvm-profgen/TailCallTracker.h:23
+
+class TailCallContextTracker {
+public:
----------------
hoy wrote:
> wenlei wrote:
> > The core improvement is to find unique path between a caller frame and callee frame, so we can infer missing frames. You identified cases of missing frames caused by tail call, so you centered the implementation around tail call. But actually recovering missing frame for FPO is not that different either. 
> > 
> > So what I'm thinking about is, have a way to not limit this to just handling tail call, instead of `TailCallContextTracker`, make it a generic `MissingFrameInferrer`. The trade off is that the search is going to be more expensive and more likely to have multiple path. But implementation wise it should be fairly similar and could be turned on/off via switch. 
> > 
> > We don't have to cover everything now, but I think `MissingFrameInferrer` could be a better name than `TailCallContextTracker`, even if we decided to limit to tail call for now. This really is an inferrer instead of a tracker. WDYT? 
> > 
> > Correspondingly `computeUniqueTailCallPath` would be `computeUniqueCallPath` even if it only deals with tail call now. 
> Yeah, renaming `TailCallContextTracker` as `MissingFrameInferrer` sounds good. 
> 
> > Correspondingly computeUniqueTailCallPath would be computeUniqueCallPath even if it only deals with tail call now.
> 
> I still like `computeUniqueTailCallPath` since what it real does is to compute a unique tail call path. Moving forward we could extend `inferMissingFrames` to make an extra call to another say `computeUniqueFPOPath` or something?
If we extend it in the future, I think it won't be a separate function. Because the tweak is going to be about what is being considered a viable edge -- currently it's tail call only, but later it can be more. I don't think we need a completely separate DFS to accommodate other edges in the future. That's why I think having one general function for DFS is reasonable. And the name can reflect that. But this isn't critical.

================
Comment at: llvm/tools/llvm-profgen/TailCallTracker.h:83-85
+  std::unordered_map<FrameTransition, std::vector<uint64_t>, PairHash>
+      UniquePaths;
+  std::unordered_map<FrameTransition, uint64_t, PairHash> NonUniquePaths;
----------------
hoy wrote:
> wenlei wrote:
> > 1. Add comment for the two data structures - what is the vector and the uint64_t value.
> > 
> > 2. I'd avoid introducing terms when more intuitive and plain expression is available. How about just use `CallerCalleePair` instead of `FrameTransition`? This also goes better with `PairHash`
> > 
> > 3. For `NonUniquePaths`, do we really care how many paths are there? Would an `unordered_set` be good enough?
> > 
> > 4. Can we merge `UniquePaths` and `NonUniquePaths`? so we don't need two separate look up. we could use empty vector to represent non-unique path?
> `CallerCalleePair` sounds good.
> 
> 
> > For NonUniquePaths, do we really care how many paths are there? Would an unordered_set be good enough?
> 
> We don't really care. `unordered_set` should be good enough.
> 
> > Can we merge UniquePaths and NonUniquePaths? so we don't need two separate look up. we could use empty vector to represent non-unique path?
> 
> Constructing an empty vector also takes extra memory and time, but is probably better than an extra hash lookup? 
> Constructing an empty vector also takes extra memory and time, but is probably better than an extra hash lookup? 

I'm not sure if this is perf critical. But I was thinking more about better structure, consolidation and readability. 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D139367/new/

https://reviews.llvm.org/D139367