[PATCH] D123271: [llvm-profgen] Filter out invalid LBR ranges.

Thu Apr 7 16:47:59 PDT 2022

wenlei added inline comments.

================
Comment at: llvm/tools/llvm-profgen/PerfReader.cpp:304

-    if (isCallState(State)) {
+    if (isInvalidState(State)) {
+      // Skip unwinding the rest of LBR trace. Reset the stack and update the
----------------
nit: put this towards the end of the if-else sequence as it's supposed to be a corner case. 

```
} else if (isValidState(State)) {
    // Unwind branches
    ...
   unwindBranch(State);
}
else {
    // Invalid state
}

```

================
Comment at: llvm/tools/llvm-profgen/PerfReader.cpp:305-308
+      // Skip unwinding the rest of LBR trace. Reset the stack and update the
+      // state so that the rest of the trace can still be processed in a
+      // context-insensitive way, i.e, all ranges will be counted towards
+      // the root context.
----------------
> all ranges will be counted towards the root context.

When we skip unwinding, we still have the context carried with probe/dwarf, so it's not always going to be base context, right? 

================
Comment at: llvm/tools/llvm-profgen/PerfReader.cpp:286-287

+  // Skip unwinding the rest of LBR trace when a bogus range is seen.
+  bool SkipUnwinding = false;
+
----------------
hoy wrote:
> wenlei wrote:
> > Is it critical to skip unwinding instead of ignore the entire record for invalid range? If not, perhaps it's better to keep it simple and just ignore such record. I imagine such invalid ranges are rare. 
> Sort of. With one of internal large services we have about 5% of such invalid ranges. Ignoring the whole trace where such ranges are in may not be trivial. 
> 
>    warning: 5.13%(77654546/1512342539) of profiled ranges is bogus.
Is this happening for just one workload or more often across different workloads? 

================
Comment at: llvm/tools/llvm-profgen/PerfReader.cpp:1202
         !Binary->offsetIsTransfer(EndOffset)) {
-      InstNotBoundary++;
+      InstNotBoundary += I.second;
       WarnInvalidRange(StartOffset, EndOffset, EndNotBoundaryMsg);
----------------
hoy wrote:
> wenlei wrote:
> > I think the choice to use unique ranges was intentional. What motivated this change?  
> Can you remind me of why it was intentional?
> 
> I think it's useful to give a number of how many samples are invalid, though some of them could be the same. This reflects the portion of samples will be dropped.
How many unique instance of one issue is an indicator of how wide spread such issue is. The way we report these stats are also aligned with how it was implemented - one "profiled range" is one unique range no matter that count it has. 

In this case, there's 5% by your counting, what's the % with original counting?

================
Comment at: llvm/tools/llvm-profgen/PerfReader.cpp:1218-1221
+    if (StartOffset > EndOffset) {
+      BogusRange += I.second;
+      WarnInvalidRange(StartOffset, EndOffset, BogusRangeMsg);
+    }
----------------
hoy wrote:
> wenlei wrote:
> > Such invalid ranges doesn't seem to be recorded anyways, how do we get those ranges? 
> Right, they'll be abandoned. The warning here is to tell how many such samples would be dropped so we kinda getting a sense how important the issue is.
My question was basically this - if such ranges are dropped earlier, how do we even see ranges whose start > end here?

Looked into this, I guess the answer is we warn on those before they are dropped?  

================
Comment at: llvm/tools/llvm-profgen/PerfReader.h:297
   void advanceLBR() { LBRIndex++; }
-  ProfiledFrame *getParentFrame() { return CurrentLeafFrame->Parent; }
+  ProfiledFrame *getParentFrame() {
+    return CurrentLeafFrame == &DummyTrieRoot ? CurrentLeafFrame
----------------
hoy wrote:
> wenlei wrote:
> > What motivated this change? Conceptually parent frame of dummy root should indeed be nullptr.
> It is to handle the diversion of probe samples and dwarf samples. With probe, LBR ranges are reported towards its parent frame (calling context), while w/o probe, ranges are reported towards the leaf frame. See `VirtualUnwinder::unwindLinear`. When unwinding is at a bad state, the call stack is reset to `DummyTrieRoot`, thus leading to a null parent technically for the probe case which will cause AV here and there.
How many call sites need to guard against this? As an API, I think `getParentFrame` better strictly do what it says if possible. But if that means sprinkling a check in a dozen places, maybe it's ok to leave it in this state.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D123271/new/

https://reviews.llvm.org/D123271