[llvm] 4357824 - [llvm-profdata] ProfileReader cleanup - preparation for MD5 refactoring

William Huang via llvm-commits llvm-commits at lists.llvm.org
Fri May 5 17:21:17 PDT 2023


Author: William Huang
Date: 2023-05-06T00:21:03Z
New Revision: 4357824c63e8bbdb7748255389ea0bc104b4a799

URL: https://github.com/llvm/llvm-project/commit/4357824c63e8bbdb7748255389ea0bc104b4a799
DIFF: https://github.com/llvm/llvm-project/commit/4357824c63e8bbdb7748255389ea0bc104b4a799.diff

LOG: [llvm-profdata] ProfileReader cleanup - preparation for MD5 refactoring

Cleanup profile reader classes to prepare for complex refactoring as propsed in D147740 (Use MD5 as key for profile map). Change is too complicated so I am cleaning up the reader implementation first with these goals.
-  Reduce duplicated/similar logic
-  Reduce virtual functions, changing them to non-virtual
-  Reduce unnecessry checks, indirections, and dead writes.

This is patch 1/n. This patch refactors NameTable

Explaining several decisions here

1. useMD5() means whether  names of the profiles (the ProfileMap) are represented as MD5. It is NOT whether the input profile format is MD5. This function is an interface for IPO passes to decide whether to match function names or function MD5. There are two motives here:
(a) Eventually we want to use MD5 to represent all function contexts because it is much faster to use it as a key for lookup tables (prototype implementation D147740), so in compilation mode we call setProfileUseMD5() to force use MD5. While in tools mode (llvm-profdata) we want to keep the function name info if it's in the input profile.
(b) We also propose to allow multiple name tables and profile sections in ExtBinary format, and it could consist of name tables with or without using MD5, in this case MD5 prevails and other name tables are converted to MD5.

2. MD5 handling logic is pushed up to BinaryReader base class, because this trades a non-devirtualized virtual function call with a predictable branch. ReadStringFromTable() accounts for >5% time when loading a full 1 GB profile, it should not be virtual.

Reviewed By: davidxl

Differential Revision: https://reviews.llvm.org/D148868

Added: 
    llvm/test/tools/llvm-profdata/Inputs/sample-multiple-nametables.profdata
    llvm/test/tools/llvm-profdata/Inputs/sample-nametable-after-samples.profdata
    llvm/test/tools/llvm-profdata/Inputs/sample-nametable-empty-string.profdata
    llvm/test/tools/llvm-profdata/sample-nametable.test

Modified: 
    llvm/include/llvm/ProfileData/SampleProfReader.h
    llvm/lib/ProfileData/SampleProfReader.cpp

Removed: 
    


################################################################################
diff  --git a/llvm/include/llvm/ProfileData/SampleProfReader.h b/llvm/include/llvm/ProfileData/SampleProfReader.h
index 3b657b3ab4621..1849417f31ee2 100644
--- a/llvm/include/llvm/ProfileData/SampleProfReader.h
+++ b/llvm/include/llvm/ProfileData/SampleProfReader.h
@@ -493,7 +493,11 @@ class SampleProfileReader {
   virtual bool dumpSectionInfo(raw_ostream &OS = dbgs()) { return false; };
 
   /// Return whether names in the profile are all MD5 numbers.
-  virtual bool useMD5() { return false; }
+  bool useMD5() const { return ProfileIsMD5; }
+
+  /// Force the profile to use MD5 in Sample contexts, even if function names
+  /// are present.
+  virtual void setProfileUseMD5() { ProfileIsMD5 = true; }
 
   /// Don't read profile without context if the flag is set. This is only meaningful
   /// for ExtBinary format.
@@ -563,6 +567,10 @@ class SampleProfileReader {
   /// Zero out the discriminator bits higher than bit MaskedBitFrom (0 based).
   /// The default is to keep all the bits.
   uint32_t MaskedBitFrom = 31;
+
+  /// Whether the profile uses MD5 for Sample Contexts and function names. This
+  /// can be one-way overriden by the user to force use MD5.
+  bool ProfileIsMD5 = false;
 };
 
 class SampleProfileReaderText : public SampleProfileReader {
@@ -579,6 +587,9 @@ class SampleProfileReaderText : public SampleProfileReader {
   /// Return true if \p Buffer is in the format supported by this class.
   static bool hasFormat(const MemoryBuffer &Buffer);
 
+  /// Text format sample profile does not support MD5 for now.
+  void setProfileUseMD5() override {}
+
 private:
   /// CSNameTable is used to save full context vectors. This serves as an
   /// underlying immutable buffer for all clients.
@@ -641,7 +652,10 @@ class SampleProfileReaderBinary : public SampleProfileReader {
   std::error_code readSummary();
 
   /// Read the whole name table.
-  virtual std::error_code readNameTable();
+  std::error_code readNameTable();
+
+  /// Read a string indirectly via the name table.
+  ErrorOr<StringRef> readStringFromTable();
 
   /// Points to the current location in the buffer.
   const uint8_t *Data = nullptr;
@@ -652,8 +666,18 @@ class SampleProfileReaderBinary : public SampleProfileReader {
   /// Function name table.
   std::vector<StringRef> NameTable;
 
-  /// Read a string indirectly via the name table.
-  virtual ErrorOr<StringRef> readStringFromTable();
+  /// If MD5 is used in NameTable section, the section saves uint64_t data.
+  /// The uint64_t data has to be converted to a string and then the string
+  /// will be used to initialize StringRef in NameTable.
+  /// Note NameTable contains StringRef so it needs another buffer to own
+  /// the string data. MD5StringBuf serves as the string buffer that is
+  /// referenced by NameTable (vector of StringRef). We make sure
+  /// the lifetime of MD5StringBuf is not shorter than that of NameTable.
+  std::vector<std::string> MD5StringBuf;
+
+  /// The starting address of NameTable containing fixed length MD5.
+  const uint8_t *MD5NameMemStart = nullptr;
+
   virtual ErrorOr<SampleContext> readSampleContextFromTable();
 
 private:
@@ -712,8 +736,7 @@ class SampleProfileReaderExtBinaryBase : public SampleProfileReaderBinary {
                                    FunctionSamples *FProfile);
   std::error_code readFuncOffsetTable();
   std::error_code readFuncProfiles();
-  std::error_code readMD5NameTable();
-  std::error_code readNameTableSec(bool IsMD5);
+  std::error_code readNameTableSec(bool IsMD5, bool FixedLengthMD5);
   std::error_code readCSNameTableSec();
   std::error_code readProfileSymbolList();
 
@@ -723,7 +746,6 @@ class SampleProfileReaderExtBinaryBase : public SampleProfileReaderBinary {
                                          const SecHdrTableEntry &Entry);
   // placeholder for subclasses to dispatch their own section readers.
   virtual std::error_code readCustomSection(const SecHdrTableEntry &Entry) = 0;
-  ErrorOr<StringRef> readStringFromTable() override;
   ErrorOr<SampleContext> readSampleContextFromTable() override;
   ErrorOr<SampleContextFrames> readContextFromTable();
 
@@ -740,21 +762,6 @@ class SampleProfileReaderExtBinaryBase : public SampleProfileReaderBinary {
   /// The set containing the functions to use when compiling a module.
   DenseSet<StringRef> FuncsToUse;
 
-  /// Use fixed length MD5 instead of ULEB128 encoding so NameTable doesn't
-  /// need to be read in up front and can be directly accessed using index.
-  bool FixedLengthMD5 = false;
-  /// The starting address of NameTable containing fixed length MD5.
-  const uint8_t *MD5NameMemStart = nullptr;
-
-  /// If MD5 is used in NameTable section, the section saves uint64_t data.
-  /// The uint64_t data has to be converted to a string and then the string
-  /// will be used to initialize StringRef in NameTable.
-  /// Note NameTable contains StringRef so it needs another buffer to own
-  /// the string data. MD5StringBuf serves as the string buffer that is
-  /// referenced by NameTable (vector of StringRef). We make sure
-  /// the lifetime of MD5StringBuf is not shorter than that of NameTable.
-  std::unique_ptr<std::vector<std::string>> MD5StringBuf;
-
   /// CSNameTable is used to save full context vectors. This serves as an
   /// underlying immutable buffer for all clients.
   std::unique_ptr<const std::vector<SampleContextFrameVector>> CSNameTable;
@@ -783,9 +790,6 @@ class SampleProfileReaderExtBinaryBase : public SampleProfileReaderBinary {
   /// the reader has been given a module.
   bool collectFuncsFromModule() override;
 
-  /// Return whether names in the profile are all MD5 numbers.
-  bool useMD5() override { return MD5StringBuf.get(); }
-
   std::unique_ptr<ProfileSymbolList> getProfileSymbolList() override {
     return std::move(ProfSymList);
   };

diff  --git a/llvm/lib/ProfileData/SampleProfReader.cpp b/llvm/lib/ProfileData/SampleProfReader.cpp
index 6f5dedb26b1aa..977ea90039b00 100644
--- a/llvm/lib/ProfileData/SampleProfReader.cpp
+++ b/llvm/lib/ProfileData/SampleProfReader.cpp
@@ -530,7 +530,20 @@ ErrorOr<StringRef> SampleProfileReaderBinary::readStringFromTable() {
   if (std::error_code EC = Idx.getError())
     return EC;
 
-  return NameTable[*Idx];
+  // Lazy loading, if the string has not been materialized from memory storing
+  // MD5 values, then it is default initialized with the null pointer. This can
+  // only happen when using fixed length MD5, that bounds check is performed
+  // while parsing the name table to ensure MD5NameMemStart points to an array
+  // with enough MD5 entries.
+  StringRef &SR = NameTable[*Idx];
+  if (!SR.data()) {
+    assert(MD5NameMemStart);
+    using namespace support;
+    uint64_t FID = endian::read<uint64_t, little, unaligned>(
+       MD5NameMemStart + (*Idx) * sizeof(uint64_t));
+    SR = MD5StringBuf.emplace_back(std::to_string(FID));
+  }
+  return SR;
 }
 
 ErrorOr<SampleContext> SampleProfileReaderBinary::readSampleContextFromTable() {
@@ -540,34 +553,6 @@ ErrorOr<SampleContext> SampleProfileReaderBinary::readSampleContextFromTable() {
   return SampleContext(*FName);
 }
 
-ErrorOr<StringRef> SampleProfileReaderExtBinaryBase::readStringFromTable() {
-  if (!FixedLengthMD5)
-    return SampleProfileReaderBinary::readStringFromTable();
-
-  // read NameTable index.
-  auto Idx = readStringIndex(NameTable);
-  if (std::error_code EC = Idx.getError())
-    return EC;
-
-  // Check whether the name to be accessed has been accessed before,
-  // if not, read it from memory directly.
-  StringRef &SR = NameTable[*Idx];
-  if (SR.empty()) {
-    const uint8_t *SavedData = Data;
-    Data = MD5NameMemStart + ((*Idx) * sizeof(uint64_t));
-    auto FID = readUnencodedNumber<uint64_t>();
-    if (std::error_code EC = FID.getError())
-      return EC;
-    // Save the string converted from uint64_t in MD5StringBuf. All the
-    // references to the name are all StringRefs refering to the string
-    // in MD5StringBuf.
-    MD5StringBuf->push_back(std::to_string(*FID));
-    SR = MD5StringBuf->back();
-    Data = SavedData;
-  }
-  return SR;
-}
-
 std::error_code
 SampleProfileReaderBinary::readProfile(FunctionSamples &FProfile) {
   auto NumSamples = readNumber<uint64_t>();
@@ -729,14 +714,15 @@ std::error_code SampleProfileReaderExtBinaryBase::readOneSection(
       FunctionSamples::ProfileIsFS = ProfileIsFS = true;
     break;
   case SecNameTable: {
-    FixedLengthMD5 =
+    bool FixedLengthMD5 =
         hasSecFlag(Entry, SecNameTableFlags::SecFlagFixedLengthMD5);
     bool UseMD5 = hasSecFlag(Entry, SecNameTableFlags::SecFlagMD5Name);
-    assert((!FixedLengthMD5 || UseMD5) &&
-           "If FixedLengthMD5 is true, UseMD5 has to be true");
+    // UseMD5 means if THIS section uses MD5, ProfileIsMD5 means if the entire
+    // profile uses MD5 for function name matching in IPO passes.
+    ProfileIsMD5 = ProfileIsMD5 || UseMD5;
     FunctionSamples::HasUniqSuffix =
         hasSecFlag(Entry, SecNameTableFlags::SecFlagUniqSuffix);
-    if (std::error_code EC = readNameTableSec(UseMD5))
+    if (std::error_code EC = readNameTableSec(UseMD5, FixedLengthMD5))
       return EC;
     break;
   }
@@ -1020,50 +1006,77 @@ std::error_code SampleProfileReaderBinary::readNameTable() {
   auto Size = readNumber<size_t>();
   if (std::error_code EC = Size.getError())
     return EC;
-  NameTable.reserve(*Size + NameTable.size());
+
+  // Normally if useMD5 is true, the name table should have MD5 values, not
+  // strings, however in the case that ExtBinary profile has multiple name
+  // tables mixing string and MD5, all of them have to be normalized to use MD5,
+  // because optimization passes can only handle either type.
+  bool UseMD5 = useMD5();
+  if (UseMD5)
+    MD5StringBuf.reserve(MD5StringBuf.size() + *Size);
+
+  NameTable.clear();
+  NameTable.reserve(*Size);
   for (size_t I = 0; I < *Size; ++I) {
     auto Name(readString());
     if (std::error_code EC = Name.getError())
       return EC;
-    NameTable.push_back(*Name);
+    if (UseMD5) {
+      uint64_t FID = MD5Hash(*Name);
+      NameTable.emplace_back(MD5StringBuf.emplace_back(std::to_string(FID)));
+    } else
+      NameTable.push_back(*Name);
   }
 
   return sampleprof_error::success;
 }
 
-std::error_code SampleProfileReaderExtBinaryBase::readMD5NameTable() {
-  auto Size = readNumber<uint64_t>();
-  if (std::error_code EC = Size.getError())
-    return EC;
-  MD5StringBuf = std::make_unique<std::vector<std::string>>();
-  MD5StringBuf->reserve(*Size);
+std::error_code
+SampleProfileReaderExtBinaryBase::readNameTableSec(bool IsMD5,
+                                                   bool FixedLengthMD5) {
   if (FixedLengthMD5) {
+    if (IsMD5)
+      errs() << "If FixedLengthMD5 is true, UseMD5 has to be true";
+    auto Size = readNumber<size_t>();
+    if (std::error_code EC = Size.getError())
+      return EC;
+
+    assert(Data + (*Size) * sizeof(uint64_t) == End &&
+           "Fixed length MD5 name table does not contain specified number of "
+           "entries");
+    if (Data + (*Size) * sizeof(uint64_t) > End)
+      return sampleprof_error::truncated;
+
     // Preallocate and initialize NameTable so we can check whether a name
     // index has been read before by checking whether the element in the
     // NameTable is empty, meanwhile readStringIndex can do the boundary
     // check using the size of NameTable.
-    NameTable.resize(*Size + NameTable.size());
-
+    MD5StringBuf.reserve(MD5StringBuf.size() + *Size);
+    NameTable.clear();
+    NameTable.resize(*Size);
     MD5NameMemStart = Data;
     Data = Data + (*Size) * sizeof(uint64_t);
     return sampleprof_error::success;
   }
-  NameTable.reserve(*Size);
-  for (uint64_t I = 0; I < *Size; ++I) {
-    auto FID = readNumber<uint64_t>();
-    if (std::error_code EC = FID.getError())
+
+  if (IsMD5) {
+    assert(!FixedLengthMD5 && "FixedLengthMD5 should be unreachable here");
+    auto Size = readNumber<size_t>();
+    if (std::error_code EC = Size.getError())
       return EC;
-    MD5StringBuf->push_back(std::to_string(*FID));
-    // NameTable is a vector of StringRef. Here it is pushing back a
-    // StringRef initialized with the last string in MD5stringBuf.
-    NameTable.push_back(MD5StringBuf->back());
+
+    MD5StringBuf.reserve(MD5StringBuf.size() + *Size);
+    NameTable.clear();
+    NameTable.reserve(*Size);
+    for (size_t I = 0; I < *Size; ++I) {
+      auto FID = readNumber<uint64_t>();
+      if (std::error_code EC = FID.getError())
+        return EC;
+      NameTable.emplace_back(MD5StringBuf.emplace_back(std::to_string(*FID)));
+    }
+    return sampleprof_error::success;
   }
-  return sampleprof_error::success;
-}
 
-std::error_code SampleProfileReaderExtBinaryBase::readNameTableSec(bool IsMD5) {
-  if (IsMD5)
-    return readMD5NameTable();
   return SampleProfileReaderBinary::readNameTable();
 }
 

diff  --git a/llvm/test/tools/llvm-profdata/Inputs/sample-multiple-nametables.profdata b/llvm/test/tools/llvm-profdata/Inputs/sample-multiple-nametables.profdata
new file mode 100644
index 0000000000000..0b31d386e8ce2
Binary files /dev/null and b/llvm/test/tools/llvm-profdata/Inputs/sample-multiple-nametables.profdata 
diff er

diff  --git a/llvm/test/tools/llvm-profdata/Inputs/sample-nametable-after-samples.profdata b/llvm/test/tools/llvm-profdata/Inputs/sample-nametable-after-samples.profdata
new file mode 100644
index 0000000000000..699972870dae4
Binary files /dev/null and b/llvm/test/tools/llvm-profdata/Inputs/sample-nametable-after-samples.profdata 
diff er

diff  --git a/llvm/test/tools/llvm-profdata/Inputs/sample-nametable-empty-string.profdata b/llvm/test/tools/llvm-profdata/Inputs/sample-nametable-empty-string.profdata
new file mode 100644
index 0000000000000..16f9b31f95eaa
Binary files /dev/null and b/llvm/test/tools/llvm-profdata/Inputs/sample-nametable-empty-string.profdata 
diff er

diff  --git a/llvm/test/tools/llvm-profdata/sample-nametable.test b/llvm/test/tools/llvm-profdata/sample-nametable.test
new file mode 100644
index 0000000000000..1d59b9fa43725
--- /dev/null
+++ b/llvm/test/tools/llvm-profdata/sample-nametable.test
@@ -0,0 +1,12 @@
+Test several edge cases with unusual name table data in ExtBinary format.
+
+1- Multiple fixed-length MD5 name tables. Reading a new table should clear the content from old table, and a valid name index for the old name table should become invalid if the new name table has fewer entries.
+RUN: not llvm-profdata show --sample %p/Inputs/sample-multiple-nametables.profdata
+
+2- Multiple name tables, the first one has an empty string, the second one tricks the reader into expecting fixed-length MD5 values. Reader should not attempt "lazy loading" of the MD5 string in this case.
+RUN: not llvm-profdata show --sample %p/Inputs/sample-nametable-empty-string.profdata
+
+3- The data of the name table is placed after the data of the profiles. The reader should handle it correctly.
+RUN: llvm-profdata merge --sample --text %p/Inputs/sample-nametable-after-samples.profdata | FileCheck %s
+CHECK: 18446744073709551615:2:9
+


        


More information about the llvm-commits mailing list