[llvm] [CSSPGO] Error out if the checksum mismatch is high (PR #84097)

via llvm-commits llvm-commits at lists.llvm.org
Tue Mar 5 16:10:27 PST 2024


llvmbot wrote:


<!--LLVM PR SUMMARY COMMENT-->

@llvm/pr-subscribers-llvm-transforms

Author: Lei Wang (wlei-llvm)

<details>
<summary>Changes</summary>

Error out the build if the checksum mismatch is extremely high, it's better to drop the profile rather than apply the bad profile. 
Note that the check is on a module level, the user could make big changes to functions in one single module but those changes might not be performance significant to the whole binary, so we want to be conservative, only expect to catch big perf regression. To do this, we select a set of the "hot" functions for the check. We use two parameter(`checksum-mismatch-func-hot-block-skip` and `checksum-mismatch-num-func-skip`) to control the function selection to make sure the selected are hot enough and the num of function is not small. 
Tuned the parameters on our internal services, it works to catch big perf regression due to the high mismatch .


---
Full diff: https://github.com/llvm/llvm-project/pull/84097.diff


2 Files Affected:

- (modified) llvm/lib/Transforms/IPO/SampleProfile.cpp (+81) 
- (added) llvm/test/Transforms/SampleProfile/pseudo-probe-profile-mismatch-error.ll (+5) 


``````````diff
diff --git a/llvm/lib/Transforms/IPO/SampleProfile.cpp b/llvm/lib/Transforms/IPO/SampleProfile.cpp
index ffc26f1a0a972d..9ec224bb32a551 100644
--- a/llvm/lib/Transforms/IPO/SampleProfile.cpp
+++ b/llvm/lib/Transforms/IPO/SampleProfile.cpp
@@ -234,6 +234,24 @@ static cl::opt<unsigned> ProfileICPRelativeHotnessSkip(
     cl::desc(
         "Skip relative hotness check for ICP up to given number of targets."));
 
+static cl::opt<unsigned> ChecksumMismatchFuncHotBlockSkip(
+    "checksum-mismatch-func-hot-block-skip", cl::Hidden, cl::init(100),
+    cl::desc("For checksum-mismatch error check, skip checking the function "
+             "whose num of hot(on average) blocks is smaller than the "
+             "given number."));
+
+static cl::opt<unsigned> ChecksumMismatchNumFuncSkip(
+    "checksum-mismatch-num-func-skip", cl::Hidden, cl::init(50),
+    cl::desc("For checksum-mismatch error check, skip the check if the total "
+             "number of selected function is smaller than the given number."));
+
+static cl::opt<unsigned> ChecksumMismatchErrorThreshold(
+    "checksum-mismatch-error-threshold", cl::Hidden, cl::init(80),
+    cl::desc(
+        "For checksum-mismatch error check, error out if the percentage of "
+        "function mismatched-checksum is higher than the given percentage "
+        "threshold"));
+
 static cl::opt<bool> CallsitePrioritizedInline(
     "sample-profile-prioritized-inline", cl::Hidden,
 
@@ -630,6 +648,8 @@ class SampleProfileLoader final : public SampleProfileLoaderBaseImpl<Function> {
   std::vector<Function *> buildFunctionOrder(Module &M, LazyCallGraph &CG);
   std::unique_ptr<ProfiledCallGraph> buildProfiledCallGraph(Module &M);
   void generateMDProfMetadata(Function &F);
+  bool errorIfHighChecksumMismatch(Module &M, ProfileSummaryInfo *PSI,
+                                   const SampleProfileMap &Profiles);
 
   /// Map from function name to Function *. Used to find the function from
   /// the function name. If the function name contains suffix, additional
@@ -2184,6 +2204,61 @@ bool SampleProfileLoader::doInitialization(Module &M,
   return true;
 }
 
+// Note that this is a module-level check. Even if one module is errored out,
+// the entire build will be errored out. However, the user could make big
+// changes to functions in single module but those changes might not be
+// performance significant to the whole binary. Therefore, we use a conservative
+// approach to make sure we only error out if it globally impacts the binary
+// performance. To achieve this, we use heuristics to select a reasonable
+// big set of functions that are supposed to be globally performance
+// significant, only compute and check the mismatch within those functions. The
+// function selection is based on two criteria: 1) The function is "hot" enough,
+// which is tuned by a hotness-based flag(ChecksumMismatchFuncHotBlockSkip). 2)
+// The num of function is large enough which is tuned by the
+// ChecksumMismatchNumFuncSkip flag.
+bool SampleProfileLoader::errorIfHighChecksumMismatch(
+    Module &M, ProfileSummaryInfo *PSI, const SampleProfileMap &Profiles) {
+  assert(FunctionSamples::ProfileIsProbeBased &&
+         "Only support for probe-based profile");
+  uint64_t TotalSelectedFunc = 0;
+  uint64_t NumMismatchedFunc = 0;
+  for (const auto &I : Profiles) {
+    const auto &FS = I.second;
+    const auto *FuncDesc = ProbeManager->getDesc(FS.getGUID());
+    if (!FuncDesc)
+      continue;
+
+    // We want to select a set of functions that are globally performance
+    // significant, in other words, if those functions profiles are
+    // checksum-mismatched and dropped, the whole binary will likely be
+    // impacted, so here we use a hotness-based threshold to control the
+    // selection.
+    if (FS.getTotalSamples() <
+        ChecksumMismatchFuncHotBlockSkip * PSI->getOrCompHotCountThreshold())
+      continue;
+
+    TotalSelectedFunc++;
+    if (ProbeManager->profileIsHashMismatched(*FuncDesc, FS))
+      NumMismatchedFunc++;
+  }
+  // Make sure that the num of selected function is not too small to distinguish
+  // from the user's benign changes.
+  if (TotalSelectedFunc < ChecksumMismatchNumFuncSkip)
+    return false;
+
+  // Finally check the mismatch percentage against the threshold.
+  if (NumMismatchedFunc * 100 >=
+      TotalSelectedFunc * ChecksumMismatchErrorThreshold) {
+    auto &Ctx = M.getContext();
+    const char *Msg =
+        "The FDO profile is too old and will cause big performance regression, "
+        "please drop the profile and collect a new one.";
+    Ctx.diagnose(DiagnosticInfoSampleProfile(M.getModuleIdentifier(), Msg));
+    return true;
+  }
+  return false;
+}
+
 void SampleProfileMatcher::findIRAnchors(
     const Function &F, std::map<LineLocation, StringRef> &IRAnchors) {
   // For inlined code, recover the original callsite and callee by finding the
@@ -2715,6 +2790,12 @@ bool SampleProfileLoader::runOnModule(Module &M, ModuleAnalysisManager *AM,
                         ProfileSummary::PSK_Sample);
     PSI->refresh();
   }
+
+  // Error out if the profile checksum mismatch is too high.
+  if (FunctionSamples::ProfileIsProbeBased &&
+      errorIfHighChecksumMismatch(M, PSI, Reader->getProfiles()))
+    return false;
+
   // Compute the total number of samples collected in this profile.
   for (const auto &I : Reader->getProfiles())
     TotalCollectedSamples += I.second.getTotalSamples();
diff --git a/llvm/test/Transforms/SampleProfile/pseudo-probe-profile-mismatch-error.ll b/llvm/test/Transforms/SampleProfile/pseudo-probe-profile-mismatch-error.ll
new file mode 100644
index 00000000000000..84a2a0ff2eb6ad
--- /dev/null
+++ b/llvm/test/Transforms/SampleProfile/pseudo-probe-profile-mismatch-error.ll
@@ -0,0 +1,5 @@
+; REQUIRES: x86_64-linux
+; RUN: opt < %S/pseudo-probe-profile-mismatch.ll -passes=sample-profile -sample-profile-file=%S/Inputs/pseudo-probe-profile-mismatch.prof -checksum-mismatch-func-hot-block-skip=0 -checksum-mismatch-num-func-skip=1 -checksum-mismatch-error-threshold=1 -S 2>%t -o %t.ll
+; RUN: FileCheck %s --input-file %t
+
+; CHECK: error: [[*]]: The FDO profile is too old and will cause big performance regression, please drop the profile and collect a new one.

``````````

</details>


https://github.com/llvm/llvm-project/pull/84097


More information about the llvm-commits mailing list