[PATCH] D139367: [CSSPGO][llvm-profgen] Missing frame inference.

Thu Dec 15 18:26:53 PST 2022

hoy added inline comments.

================
Comment at: llvm/tools/llvm-profgen/ProfiledBinary.h:252-256
+  // A map of call instructions to their target addresses.
+  std::unordered_multimap<uint64_t, uint64_t> CallTargets;
+
+  // A map of tail call instructions to their target addresses.
+  std::unordered_multimap<uint64_t, uint64_t> TailCallTargets;
----------------
wenlei wrote:
> hoy wrote:
> > wenlei wrote:
> > > hoy wrote:
> > > > wenlei wrote:
> > > > > 1. Did you choose `unordered_multimap<uint64_t, uint64_t>` over `unordered_map<uint64_t, unordered_set<uint64_t>>` to optimize for mostly 1 target per call? multimap has multiple copies of the keys too, but map doesn't. The later feels a bit cleaner..
> > > > > 
> > > > > 2. The name is bit confusing because this contains both call source and target. 
> > > > > 
> > > > > 3. I haven't think through this carefully, but these data structures seems quite ad-hoc in that it somewhat overlaps with both CallAddressSet above and the data in the new tracker. I know that each serves slightly different purpose, but it just feels a bit unorganized, and bolted on the way it is now.  
> > > > > Did you choose unordered_multimap<uint64_t, uint64_t> over unordered_map<uint64_t, unordered_set<uint64_t>> to optimize for mostly 1 target per call? multimap has multiple copies of the keys too, but map doesn't. The later feels a bit cleaner..
> > > > 
> > > > Yeah, that's the point. Single targets take 95% and with multipmap it wouldn't bother construct a container for each of those targets. Duplications are handled in `refineTailCallTargets` .
> > > > 
> > > > > The name is bit confusing because this contains both call source and target.
> > > > 
> > > > How about CallPairs or just CallEdges?
> > > > 
> > > > 
> > > > > I haven't think through this carefully, but these data structures seems quite ad-hoc in that it somewhat overlaps with both CallAddressSet above and the data in the new tracker. I know that each serves slightly different purpose, but it just feels a bit unorganized, and bolted on the way it is now.
> > > > 
> > > > Yeah, they are separated mostly because of efficiency. A `unordered_map<uint64_t, unordered_set<uint64_t>>` would unify them but at the expense of efficiency. 
> > > There's also duplication between ProfiledBinary members and MissingFrameInferrer members. 
> > The MissingFrameInferrer members are mostly `BinaryFunction` based while the ProfiledBinary members are address based. The separation is to reduce the disassembling time by avoiding the address to BinaryFunction lookup there.
> > Yeah, that's the point. Single targets take 95% and with multipmap it wouldn't bother construct a container for each of those targets. Duplications are handled in refineTailCallTargets .
> 
> You mentioned efficiency a few times here. I agree that unifying may have a (slight) cost. But I feel that the actual cost of inferring missing frame is from the "query" or "infer" time, not the "setup" time here. 
> 
> These things are hard to estimate exactly without measuring, but I would guess that changing the data structure to favor readability, simplicity and structure would not impact performance visibly as long as it doesn't hurt query path. For that reason, I feel that the optimization here could be premature at the expense of readability, and I'm still not convinced it is needed. 
> 
> > How about CallPairs or just CallEdges? 
> 
> CallEdges sounds good. 
> 
> > The MissingFrameInferrer members are mostly BinaryFunction based while the ProfiledBinary members are address based. The separation is to reduce the disassembling time by avoiding the address to BinaryFunction lookup there.
> 
> The look up should be quick, right? And shifting the look up cost earlier (as opposed to later in inferrer) doesn't mean we're doing more work?
> 
> One other thing to consider is to move data structure only needed by tail call into the inferrer, as these are technically owned by inferrer. Ideally, we want to reduce duplication, but if that's not possible, we can move the address based map into inferrer as well, and have it populated during disassembly time. This way these seemingly similar data is at least centralized together.
OK, switched to using single maps, which appears to be 3% slower than using multimaps. This pushes out the existing 8% overhead to 11%. Do we think it's worth a trade-off for readability?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D139367/new/

https://reviews.llvm.org/D139367