[llvm] [llvm-profdata] Remove MD5 collision check in D147740 (PR #66544)

William Junda Huang via llvm-commits llvm-commits at lists.llvm.org
Fri Sep 15 13:25:16 PDT 2023


https://github.com/huangjd created https://github.com/llvm/llvm-project/pull/66544

This is the patch at https://reviews.llvm.org/D153692, migrating to Github

After testing D147740 with multiple industrial projects with ~10 million FunctionSamples, no MD5 collision has been found. In perfect hashing, the probability of collision for N symbols over K possible hash value is 1 - K!/((K-N)! * K^N). When N is 1 million and K is 2^64, the probability is 3*10^-8, when N is 10 million the probability is 3*10^-6, so we are probably not going to find an actual case in real world application. (However if K is 2^32, the probability of collision is almost 1, this is indeed a problem, if anyone still use a large profile on 32-bit machine, as hash_code is tied to size_t). Furthermore, when a collision happens we can't do anything to recover it, unless using a multi-map, but that is significantly slower, which contradicts the purpose of optimizing the profile reader. One more thing, since we have been using profiles with MD5 names, and they have to be coming from non-MD5 sources, so if hash collision is to happen, it already happened when we convert a non-MD5 profile to a MD5 one, so there's no point to check for that in the reader, and this feature can be removed.

>From b6bfc6127925a66029227b1d83a2b0ae890c1b08 Mon Sep 17 00:00:00 2001
From: William Huang <williamjhuang at google.com>
Date: Fri, 15 Sep 2023 20:20:16 +0000
Subject: [PATCH] [llvm-profdata] Remove MD5 collision check in D147740

This is the patch at https://reviews.llvm.org/D153692, migrating to
Github

After testing D147740 with multiple industrial projects with ~10 million FunctionSamples, no MD5 collision has been found.
In perfect hashing, the probability of collision for N symbols over K possible hash value is 1 - K!/((K-N)! * K^N). When N is 1 million and K is 2^64, the probability is 3*10^-8, when N is 10 million the probability is 3*10^-6, so we are probably not going to find an actual case in real world application. (However if K is 2^32, the probability of collision is almost 1, this is indeed a problem, if anyone still use a large profile on 32-bit machine, as hash_code is tied to size_t).
Furthermore, when a collision happens we can't do anything to recover it, unless using a multi-map, but that is significantly slower, which contradicts the purpose of optimizing the profile reader.
One more thing, since we have been using profiles with MD5 names, and they have to be coming from non-MD5 sources, so if hash collision is to happen, it already happened when we convert a non-MD5 profile to a MD5 one, so there's no point to check for that in the reader, and this feature can be removed.
---
 llvm/include/llvm/ProfileData/SampleProf.h    |  68 ++-----
 .../tools/llvm-profdata/CMakeLists.txt        |   1 -
 .../tools/llvm-profdata/MD5CollisionTest.cpp  | 166 ------------------
 3 files changed, 10 insertions(+), 225 deletions(-)
 delete mode 100644 llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp

diff --git a/llvm/include/llvm/ProfileData/SampleProf.h b/llvm/include/llvm/ProfileData/SampleProf.h
index 3cbfa94896e3a24..181ddac0daf91af 100644
--- a/llvm/include/llvm/ProfileData/SampleProf.h
+++ b/llvm/include/llvm/ProfileData/SampleProf.h
@@ -1299,19 +1299,13 @@ raw_ostream &operator<<(raw_ostream &OS, const FunctionSamples &FS);
 /// performance of insert and query operations especially when hash values of
 /// keys are available a priori, and reduces memory usage if KeyT has a large
 /// size.
-/// When performing any action, if an existing entry with a given key is found,
-/// and the interface "KeyT ValueT::getKey<KeyT>() const" to retrieve a value's
-/// original key exists, this class checks if the given key actually matches
-/// the existing entry's original key. If they do not match, this class behaves
-/// as if the entry did not exist (for insertion, this means the new value will
-/// replace the existing entry's value, as if it is newly inserted). If
-/// ValueT::getKey<KeyT>() is not available, all keys with the same hash value
-/// are considered equivalent (i.e. hash collision is silently ignored). Given
-/// such feature this class should only be used where it does not affect
-/// compilation correctness, for example, when loading a sample profile.
+/// All keys with the same hash value are considered equivalent (i.e. hash
+/// collision is silently ignored). Given such feature this class should only be
+/// used where it does not affect compilation correctness, for example, when
+/// loading a sample profile.
 /// Assuming the hashing algorithm is uniform, the probability of hash collision
 /// with 1,000,000 entries is
-/// (2^64)!/((2^64-1000000)!*(2^64)^1000000) ~= 3*10^-8.
+/// 1 - (2^64)!/((2^64-1000000)!*(2^64)^1000000) ~= 3*10^-8.
 template <template <typename, typename, typename...> typename MapT,
           typename KeyT, typename ValueT, typename... MapTArgs>
 class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
@@ -1325,42 +1319,12 @@ class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
   using iterator = typename base_type::iterator;
   using const_iterator = typename base_type::const_iterator;
 
-private:
-  // If the value type has getKey(), retrieve its original key for comparison.
-  template <typename U = mapped_type,
-            typename = decltype(U().template getKey<original_key_type>())>
-  static bool
-  CheckKeyMatch(const original_key_type &Key, const mapped_type &ExistingValue,
-                original_key_type *ExistingKeyIfDifferent = nullptr) {
-    const original_key_type &ExistingKey =
-        ExistingValue.template getKey<original_key_type>();
-    bool Result = (Key == ExistingKey);
-    if (!Result && ExistingKeyIfDifferent)
-      *ExistingKeyIfDifferent = ExistingKey;
-    return Result;
-  }
-
-  // If getKey() does not exist, this overload is selected, which assumes all
-  // keys with the same hash are equivalent.
-  static bool CheckKeyMatch(...) { return true; }
-
-public:
   template <typename... Ts>
   std::pair<iterator, bool> try_emplace(const key_type &Hash,
                                         const original_key_type &Key,
                                         Ts &&...Args) {
     assert(Hash == hash_value(Key));
-    auto Ret = base_type::try_emplace(Hash, std::forward<Ts>(Args)...);
-    if (!Ret.second) {
-      original_key_type ExistingKey;
-      if (LLVM_UNLIKELY(!CheckKeyMatch(Key, Ret.first->second, &ExistingKey))) {
-        dbgs() << "MD5 collision detected: " << Key << " and " << ExistingKey
-               << " has same hash value " << Hash << "\n";
-        Ret.second = true;
-        Ret.first->second = mapped_type(std::forward<Ts>(Args)...);
-      }
-    }
-    return Ret;
+    return base_type::try_emplace(Hash, std::forward<Ts>(Args)...);
   }
 
   template <typename... Ts>
@@ -1382,8 +1346,7 @@ class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
     key_type Hash = hash_value(Key);
     auto It = base_type::find(Hash);
     if (It != base_type::end())
-      if (LLVM_LIKELY(CheckKeyMatch(Key, It->second)))
-        return It;
+      return It;
     return base_type::end();
   }
 
@@ -1391,8 +1354,7 @@ class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
     key_type Hash = hash_value(Key);
     auto It = base_type::find(Hash);
     if (It != base_type::end())
-      if (LLVM_LIKELY(CheckKeyMatch(Key, It->second)))
-        return It;
+      return It;
     return base_type::end();
   }
 
@@ -1433,19 +1395,9 @@ class SampleProfileMap
         Ctx);
   }
 
-  // Overloaded find() to lookup a function by name. This is called by IPO
-  // passes with an actual function name, and it is possible that the profile
-  // reader converted function names in the profile to MD5 strings, so we need
-  // to check if either representation matches.
+  // Overloaded find() to lookup a function by name.
   iterator find(StringRef Fname) {
-    uint64_t Hash = hashFuncName(Fname);
-    auto It = base_type::find(hash_code(Hash));
-    if (It != end()) {
-      StringRef CtxName = It->second.getContext().getName();
-      if (LLVM_LIKELY(CtxName == Fname || CtxName == std::to_string(Hash)))
-        return It;
-    }
-    return end();
+    return base_type::find(hashFuncName(Fname));
   }
 
   size_t erase(const SampleContext &Ctx) {
diff --git a/llvm/unittests/tools/llvm-profdata/CMakeLists.txt b/llvm/unittests/tools/llvm-profdata/CMakeLists.txt
index c53baf69bbc0ef8..ad91ce36bcb5bef 100644
--- a/llvm/unittests/tools/llvm-profdata/CMakeLists.txt
+++ b/llvm/unittests/tools/llvm-profdata/CMakeLists.txt
@@ -6,7 +6,6 @@ set(LLVM_LINK_COMPONENTS
 
 add_llvm_unittest(LLVMProfdataTests
     OutputSizeLimitTest.cpp
-    MD5CollisionTest.cpp
   )
 
 target_link_libraries(LLVMProfdataTests PRIVATE LLVMTestingSupport)
diff --git a/llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp b/llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp
deleted file mode 100644
index 8af74249188dd14..000000000000000
--- a/llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp
+++ /dev/null
@@ -1,166 +0,0 @@
-//===- llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp ------------===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-
-/// Test whether the MD5-key SampleProfileMap can handle collision correctly.
-/// Probability of collision is rare but not negligible since we only use the
-/// lower 64 bits of the MD5 value. A unit test is required because the function
-/// names are not printable ASCII characters.
-
-#include "llvm/ProfileData/SampleProfReader.h"
-#include "llvm/Support/VirtualFileSystem.h"
-#include "llvm/Testing/Support/Error.h"
-#include "gtest/gtest.h"
-
-/// According to https://en.wikipedia.org/wiki/MD5#Preimage_vulnerability, the
-/// MD5 of the two strings are 79054025255fb1a26e4bc422aef54eb4.
-
-// First 8 bytes of the MD5.
-const uint64_t ExpectedHash = 0xa2b15f2525400579;
-
-// clang-format off
-const uint8_t ProfileData[] = {
-    0x84, 0xe4, 0xd0, 0xb1, 0xf4, 0xc9, 0x94, 0xa8,
-    0x53, 0x67, 0x03, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x7D, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x03, 0x01, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x04, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x80, 0x01, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x05, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x90, 0x01, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x00, 0x00, 0x00,
-
-    /// Name Table
-    0x02,
-    /// String1
-    0xd1, 0x31, 0xdd, 0x02, 0xc5, 0xe6, 0xee, 0xc4,
-    0x69, 0x3d, 0x9a, 0x06, 0x98, 0xaf, 0xf9, 0x5c,
-    0x2f, 0xca, 0xb5, 0x87, 0x12, 0x46, 0x7e, 0xab,
-    0x40, 0x04, 0x58, 0x3e, 0xb8, 0xfb, 0x7f, 0x89,
-    0x55, 0xad, 0x34, 0x06, 0x09, 0xf4, 0xb3, 0x02,
-    0x83, 0xe4, 0x88, 0x83, 0x25, 0x71, 0x41, 0x5a,
-    0x08, 0x51, 0x25, 0xe8, 0xf7, 0xcd, 0xc9, 0x9f,
-    0xd9, 0x1d, 0xbd, 0xf2, 0x80, 0x37, 0x3c, 0x5b,
-    0xd8, 0x82, 0x3e, 0x31, 0x56, 0x34, 0x8f, 0x5b,
-    0xae, 0x6d, 0xac, 0xd4, 0x36, 0xc9, 0x19, 0xc6,
-    0xdd, 0x53, 0xe2, 0xb4, 0x87, 0xda, 0x03, 0xfd,
-    0x02, 0x39, 0x63, 0x06, 0xd2, 0x48, 0xcd, 0xa0,
-    0xe9, 0x9f, 0x33, 0x42, 0x0f, 0x57, 0x7e, 0xe8,
-    0xce, 0x54, 0xb6, 0x70, 0x80, 0xa8, 0x0d, 0x1e,
-    0xc6, 0x98, 0x21, 0xbc, 0xb6, 0xa8, 0x83, 0x93,
-    0x96, 0xf9, 0x65, 0x2b, 0x6f, 0xf7, 0x2a, 0x70, 0x00,
-    /// String2
-    0xd1, 0x31, 0xdd, 0x02, 0xc5, 0xe6, 0xee, 0xc4,
-    0x69, 0x3d, 0x9a, 0x06, 0x98, 0xaf, 0xf9, 0x5c,
-    0x2f, 0xca, 0xb5, 0x07, 0x12, 0x46, 0x7e, 0xab,
-    0x40, 0x04, 0x58, 0x3e, 0xb8, 0xfb, 0x7f, 0x89,
-    0x55, 0xad, 0x34, 0x06, 0x09, 0xf4, 0xb3, 0x02,
-    0x83, 0xe4, 0x88, 0x83, 0x25, 0xf1, 0x41, 0x5a,
-    0x08, 0x51, 0x25, 0xe8, 0xf7, 0xcd, 0xc9, 0x9f,
-    0xd9, 0x1d, 0xbd, 0x72, 0x80, 0x37, 0x3c, 0x5b,
-    0xd8, 0x82, 0x3e, 0x31, 0x56, 0x34, 0x8f, 0x5b,
-    0xae, 0x6d, 0xac, 0xd4, 0x36, 0xc9, 0x19, 0xc6,
-    0xdd, 0x53, 0xe2, 0x34, 0x87, 0xda, 0x03, 0xfd,
-    0x02, 0x39, 0x63, 0x06, 0xd2, 0x48, 0xcd, 0xa0,
-    0xe9, 0x9f, 0x33, 0x42, 0x0f, 0x57, 0x7e, 0xe8,
-    0xce, 0x54, 0xb6, 0x70, 0x80, 0x28, 0x0d, 0x1e,
-    0xc6, 0x98, 0x21, 0xbc, 0xb6, 0xa8, 0x83, 0x93,
-    0x96, 0xf9, 0x65, 0xab, 0x6f, 0xf7, 0x2a, 0x70, 0x00,
-
-    /// FuncOffsetTable
-    0x02, 0x00, 0x00, 0x01, 0x17, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
-
-    /// Samples
-    /// String1:10:1
-    ///  1: 5
-    ///  2.3: 6
-    ///  4: String2:100
-    ///   1: 100
-    /// String2:7:3
-    ///  9: 0
-    0x01, 0x00, 0x0a, 0x02, 0x01, 0x00, 0x05, 0x00,
-    0x02, 0x03, 0x06, 0x00, 0x01, 0x04, 0x00, 0x01,
-    0x64, 0x01, 0x01, 0x00, 0x64, 0x00, 0x00,
-
-    0x03, 0x01, 0x07, 0x01, 0x09, 0x00, 0x00, 0x00,
-    0x00};
-// clang-format on
-
-using namespace llvm;
-using namespace llvm::sampleprof;
-
-TEST(MD5CollisionTest, TestCollision) {
-  auto InputBuffer = MemoryBuffer::getMemBuffer(
-      StringRef(reinterpret_cast<const char *>(ProfileData),
-                sizeof(ProfileData)),
-      "", false);
-  LLVMContext Context;
-  auto FileSystem = vfs::getRealFileSystem();
-  auto Result = SampleProfileReader::create(InputBuffer, Context, *FileSystem);
-  ASSERT_TRUE(Result);
-  SampleProfileReader *Reader = Result->get();
-  ASSERT_FALSE(Reader->read());
-
-  std::vector<StringRef> &NameTable = *Reader->getNameTable();
-  ASSERT_EQ(NameTable.size(), 2U);
-  StringRef S1 = NameTable[0];
-  StringRef S2 = NameTable[1];
-  ASSERT_NE(S1, S2);
-  ASSERT_EQ(MD5Hash(S1), ExpectedHash);
-  ASSERT_EQ(MD5Hash(S2), ExpectedHash);
-
-  // S2's MD5 value collides with S1, S1 is expected to be dropped when S2 is
-  // inserted, as if S1 never existed.
-
-  FunctionSamples ExpectedFS;
-  ExpectedFS.setName(S2);
-  ExpectedFS.setHeadSamples(3);
-  ExpectedFS.setTotalSamples(7);
-  ExpectedFS.addBodySamples(9, 0, 0);
-
-  SampleProfileMap &Profiles = Reader->getProfiles();
-  EXPECT_EQ(Profiles.size(), 1U);
-  if (Profiles.size()) {
-    auto &[Hash, FS] = *Profiles.begin();
-    EXPECT_EQ(Hash, hash_code(ExpectedHash));
-    EXPECT_EQ(FS, ExpectedFS);
-  }
-
-  // Inserting S2 again should fail, returning the existing sample unchanged.
-  auto [It1, Inserted1] = Profiles.try_emplace(S2, FunctionSamples());
-  EXPECT_FALSE(Inserted1);
-  EXPECT_EQ(Profiles.size(), 1U);
-  if (Profiles.size()) {
-    auto &[Hash, FS] = *It1;
-    EXPECT_EQ(Hash, hash_code(ExpectedHash));
-    EXPECT_EQ(FS, ExpectedFS);
-  }
-
-  // Inserting S1 should success as if S2 never existed, and S2 is erased.
-  FunctionSamples FS1;
-  FS1.setName(S1);
-  FS1.setHeadSamples(5);
-  FS1.setTotalSamples(10);
-  FS1.addBodySamples(1, 2, 5);
-
-  auto [It2, Inserted2] = Profiles.try_emplace(S1, FS1);
-  EXPECT_TRUE(Inserted2);
-  EXPECT_EQ(Profiles.size(), 1U);
-  if (Profiles.size()) {
-    auto &[Hash, FS] = *It2;
-    EXPECT_EQ(Hash, hash_code(ExpectedHash));
-    EXPECT_EQ(FS, FS1);
-  }
-}



More information about the llvm-commits mailing list