[llvm] f4f85e0 - [llvm-profdata] Remove MD5 collision check in D147740 (#66544)

via llvm-commits llvm-commits at lists.llvm.org
Fri Sep 15 15:30:56 PDT 2023


Author: William Junda Huang
Date: 2023-09-15T22:30:51Z
New Revision: f4f85e0ab405c89e1b843401a055538bd26a0187

URL: https://github.com/llvm/llvm-project/commit/f4f85e0ab405c89e1b843401a055538bd26a0187
DIFF: https://github.com/llvm/llvm-project/commit/f4f85e0ab405c89e1b843401a055538bd26a0187.diff

LOG: [llvm-profdata] Remove MD5 collision check in D147740 (#66544)

This is the patch at https://reviews.llvm.org/D153692, migrating to
Github

After testing D147740 with multiple industrial projects with ~10 million
FunctionSamples, no MD5 collision has been found. In perfect hashing,
the probability of collision for N symbols over K possible hash value is
1 - K!/((K-N)! * K^N). When N is 1 million and K is 2^64, the
probability is 3*10^-8, when N is 10 million the probability is 3*10^-6,
so we are probably not going to find an actual case in real world
application. (However if K is 2^32, the probability of collision is
almost 1, this is indeed a problem, if anyone still use a large profile
on 32-bit machine, as hash_code is tied to size_t). Furthermore, when a
collision happens we can't do anything to recover it, unless using a
multi-map, but that is significantly slower, which contradicts the
purpose of optimizing the profile reader. One more thing, since we have
been using profiles with MD5 names, and they have to be coming from
non-MD5 sources, so if hash collision is to happen, it already happened
when we convert a non-MD5 profile to a MD5 one, so there's no point to
check for that in the reader, and this feature can be removed.

Added: 
    

Modified: 
    llvm/include/llvm/ProfileData/SampleProf.h
    llvm/unittests/tools/llvm-profdata/CMakeLists.txt

Removed: 
    llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp


################################################################################
diff  --git a/llvm/include/llvm/ProfileData/SampleProf.h b/llvm/include/llvm/ProfileData/SampleProf.h
index 3cbfa94896e3a24..9b2416e39b6cc9a 100644
--- a/llvm/include/llvm/ProfileData/SampleProf.h
+++ b/llvm/include/llvm/ProfileData/SampleProf.h
@@ -1299,19 +1299,16 @@ raw_ostream &operator<<(raw_ostream &OS, const FunctionSamples &FS);
 /// performance of insert and query operations especially when hash values of
 /// keys are available a priori, and reduces memory usage if KeyT has a large
 /// size.
-/// When performing any action, if an existing entry with a given key is found,
-/// and the interface "KeyT ValueT::getKey<KeyT>() const" to retrieve a value's
-/// original key exists, this class checks if the given key actually matches
-/// the existing entry's original key. If they do not match, this class behaves
-/// as if the entry did not exist (for insertion, this means the new value will
-/// replace the existing entry's value, as if it is newly inserted). If
-/// ValueT::getKey<KeyT>() is not available, all keys with the same hash value
-/// are considered equivalent (i.e. hash collision is silently ignored). Given
-/// such feature this class should only be used where it does not affect
-/// compilation correctness, for example, when loading a sample profile.
-/// Assuming the hashing algorithm is uniform, the probability of hash collision
-/// with 1,000,000 entries is
-/// (2^64)!/((2^64-1000000)!*(2^64)^1000000) ~= 3*10^-8.
+/// All keys with the same hash value are considered equivalent (i.e. hash
+/// collision is silently ignored). Given such feature this class should only be
+/// used where it does not affect compilation correctness, for example, when
+/// loading a sample profile.
+/// Assuming the hashing algorithm is uniform, we use the formula
+/// 1 - Permute(n, k) / n ^ k where n is the universe size and k is number of
+/// elements chosen at random to calculate the probability of collision. With
+/// 1,000,000 entries the probability is negligible:
+/// 1 - (2^64)!/((2^64-1000000)!*(2^64)^1000000) ~= 3*10^-8.
+/// Source: https://en.wikipedia.org/wiki/Birthday_problem
 template <template <typename, typename, typename...> typename MapT,
           typename KeyT, typename ValueT, typename... MapTArgs>
 class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
@@ -1325,42 +1322,12 @@ class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
   using iterator = typename base_type::iterator;
   using const_iterator = typename base_type::const_iterator;
 
-private:
-  // If the value type has getKey(), retrieve its original key for comparison.
-  template <typename U = mapped_type,
-            typename = decltype(U().template getKey<original_key_type>())>
-  static bool
-  CheckKeyMatch(const original_key_type &Key, const mapped_type &ExistingValue,
-                original_key_type *ExistingKeyIfDifferent = nullptr) {
-    const original_key_type &ExistingKey =
-        ExistingValue.template getKey<original_key_type>();
-    bool Result = (Key == ExistingKey);
-    if (!Result && ExistingKeyIfDifferent)
-      *ExistingKeyIfDifferent = ExistingKey;
-    return Result;
-  }
-
-  // If getKey() does not exist, this overload is selected, which assumes all
-  // keys with the same hash are equivalent.
-  static bool CheckKeyMatch(...) { return true; }
-
-public:
   template <typename... Ts>
   std::pair<iterator, bool> try_emplace(const key_type &Hash,
                                         const original_key_type &Key,
                                         Ts &&...Args) {
     assert(Hash == hash_value(Key));
-    auto Ret = base_type::try_emplace(Hash, std::forward<Ts>(Args)...);
-    if (!Ret.second) {
-      original_key_type ExistingKey;
-      if (LLVM_UNLIKELY(!CheckKeyMatch(Key, Ret.first->second, &ExistingKey))) {
-        dbgs() << "MD5 collision detected: " << Key << " and " << ExistingKey
-               << " has same hash value " << Hash << "\n";
-        Ret.second = true;
-        Ret.first->second = mapped_type(std::forward<Ts>(Args)...);
-      }
-    }
-    return Ret;
+    return base_type::try_emplace(Hash, std::forward<Ts>(Args)...);
   }
 
   template <typename... Ts>
@@ -1382,8 +1349,7 @@ class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
     key_type Hash = hash_value(Key);
     auto It = base_type::find(Hash);
     if (It != base_type::end())
-      if (LLVM_LIKELY(CheckKeyMatch(Key, It->second)))
-        return It;
+      return It;
     return base_type::end();
   }
 
@@ -1391,8 +1357,7 @@ class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
     key_type Hash = hash_value(Key);
     auto It = base_type::find(Hash);
     if (It != base_type::end())
-      if (LLVM_LIKELY(CheckKeyMatch(Key, It->second)))
-        return It;
+      return It;
     return base_type::end();
   }
 
@@ -1433,19 +1398,9 @@ class SampleProfileMap
         Ctx);
   }
 
-  // Overloaded find() to lookup a function by name. This is called by IPO
-  // passes with an actual function name, and it is possible that the profile
-  // reader converted function names in the profile to MD5 strings, so we need
-  // to check if either representation matches.
+  // Overloaded find() to lookup a function by name.
   iterator find(StringRef Fname) {
-    uint64_t Hash = hashFuncName(Fname);
-    auto It = base_type::find(hash_code(Hash));
-    if (It != end()) {
-      StringRef CtxName = It->second.getContext().getName();
-      if (LLVM_LIKELY(CtxName == Fname || CtxName == std::to_string(Hash)))
-        return It;
-    }
-    return end();
+    return base_type::find(hashFuncName(Fname));
   }
 
   size_t erase(const SampleContext &Ctx) {

diff  --git a/llvm/unittests/tools/llvm-profdata/CMakeLists.txt b/llvm/unittests/tools/llvm-profdata/CMakeLists.txt
index c53baf69bbc0ef8..ad91ce36bcb5bef 100644
--- a/llvm/unittests/tools/llvm-profdata/CMakeLists.txt
+++ b/llvm/unittests/tools/llvm-profdata/CMakeLists.txt
@@ -6,7 +6,6 @@ set(LLVM_LINK_COMPONENTS
 
 add_llvm_unittest(LLVMProfdataTests
     OutputSizeLimitTest.cpp
-    MD5CollisionTest.cpp
   )
 
 target_link_libraries(LLVMProfdataTests PRIVATE LLVMTestingSupport)

diff  --git a/llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp b/llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp
deleted file mode 100644
index 8af74249188dd14..000000000000000
--- a/llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp
+++ /dev/null
@@ -1,166 +0,0 @@
-//===- llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp ------------===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-
-/// Test whether the MD5-key SampleProfileMap can handle collision correctly.
-/// Probability of collision is rare but not negligible since we only use the
-/// lower 64 bits of the MD5 value. A unit test is required because the function
-/// names are not printable ASCII characters.
-
-#include "llvm/ProfileData/SampleProfReader.h"
-#include "llvm/Support/VirtualFileSystem.h"
-#include "llvm/Testing/Support/Error.h"
-#include "gtest/gtest.h"
-
-/// According to https://en.wikipedia.org/wiki/MD5#Preimage_vulnerability, the
-/// MD5 of the two strings are 79054025255fb1a26e4bc422aef54eb4.
-
-// First 8 bytes of the MD5.
-const uint64_t ExpectedHash = 0xa2b15f2525400579;
-
-// clang-format off
-const uint8_t ProfileData[] = {
-    0x84, 0xe4, 0xd0, 0xb1, 0xf4, 0xc9, 0x94, 0xa8,
-    0x53, 0x67, 0x03, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x7D, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x03, 0x01, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x04, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x80, 0x01, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x05, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x90, 0x01, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x00, 0x00, 0x00,
-
-    /// Name Table
-    0x02,
-    /// String1
-    0xd1, 0x31, 0xdd, 0x02, 0xc5, 0xe6, 0xee, 0xc4,
-    0x69, 0x3d, 0x9a, 0x06, 0x98, 0xaf, 0xf9, 0x5c,
-    0x2f, 0xca, 0xb5, 0x87, 0x12, 0x46, 0x7e, 0xab,
-    0x40, 0x04, 0x58, 0x3e, 0xb8, 0xfb, 0x7f, 0x89,
-    0x55, 0xad, 0x34, 0x06, 0x09, 0xf4, 0xb3, 0x02,
-    0x83, 0xe4, 0x88, 0x83, 0x25, 0x71, 0x41, 0x5a,
-    0x08, 0x51, 0x25, 0xe8, 0xf7, 0xcd, 0xc9, 0x9f,
-    0xd9, 0x1d, 0xbd, 0xf2, 0x80, 0x37, 0x3c, 0x5b,
-    0xd8, 0x82, 0x3e, 0x31, 0x56, 0x34, 0x8f, 0x5b,
-    0xae, 0x6d, 0xac, 0xd4, 0x36, 0xc9, 0x19, 0xc6,
-    0xdd, 0x53, 0xe2, 0xb4, 0x87, 0xda, 0x03, 0xfd,
-    0x02, 0x39, 0x63, 0x06, 0xd2, 0x48, 0xcd, 0xa0,
-    0xe9, 0x9f, 0x33, 0x42, 0x0f, 0x57, 0x7e, 0xe8,
-    0xce, 0x54, 0xb6, 0x70, 0x80, 0xa8, 0x0d, 0x1e,
-    0xc6, 0x98, 0x21, 0xbc, 0xb6, 0xa8, 0x83, 0x93,
-    0x96, 0xf9, 0x65, 0x2b, 0x6f, 0xf7, 0x2a, 0x70, 0x00,
-    /// String2
-    0xd1, 0x31, 0xdd, 0x02, 0xc5, 0xe6, 0xee, 0xc4,
-    0x69, 0x3d, 0x9a, 0x06, 0x98, 0xaf, 0xf9, 0x5c,
-    0x2f, 0xca, 0xb5, 0x07, 0x12, 0x46, 0x7e, 0xab,
-    0x40, 0x04, 0x58, 0x3e, 0xb8, 0xfb, 0x7f, 0x89,
-    0x55, 0xad, 0x34, 0x06, 0x09, 0xf4, 0xb3, 0x02,
-    0x83, 0xe4, 0x88, 0x83, 0x25, 0xf1, 0x41, 0x5a,
-    0x08, 0x51, 0x25, 0xe8, 0xf7, 0xcd, 0xc9, 0x9f,
-    0xd9, 0x1d, 0xbd, 0x72, 0x80, 0x37, 0x3c, 0x5b,
-    0xd8, 0x82, 0x3e, 0x31, 0x56, 0x34, 0x8f, 0x5b,
-    0xae, 0x6d, 0xac, 0xd4, 0x36, 0xc9, 0x19, 0xc6,
-    0xdd, 0x53, 0xe2, 0x34, 0x87, 0xda, 0x03, 0xfd,
-    0x02, 0x39, 0x63, 0x06, 0xd2, 0x48, 0xcd, 0xa0,
-    0xe9, 0x9f, 0x33, 0x42, 0x0f, 0x57, 0x7e, 0xe8,
-    0xce, 0x54, 0xb6, 0x70, 0x80, 0x28, 0x0d, 0x1e,
-    0xc6, 0x98, 0x21, 0xbc, 0xb6, 0xa8, 0x83, 0x93,
-    0x96, 0xf9, 0x65, 0xab, 0x6f, 0xf7, 0x2a, 0x70, 0x00,
-
-    /// FuncOffsetTable
-    0x02, 0x00, 0x00, 0x01, 0x17, 0x00, 0x00, 0x00,
-    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
-
-    /// Samples
-    /// String1:10:1
-    ///  1: 5
-    ///  2.3: 6
-    ///  4: String2:100
-    ///   1: 100
-    /// String2:7:3
-    ///  9: 0
-    0x01, 0x00, 0x0a, 0x02, 0x01, 0x00, 0x05, 0x00,
-    0x02, 0x03, 0x06, 0x00, 0x01, 0x04, 0x00, 0x01,
-    0x64, 0x01, 0x01, 0x00, 0x64, 0x00, 0x00,
-
-    0x03, 0x01, 0x07, 0x01, 0x09, 0x00, 0x00, 0x00,
-    0x00};
-// clang-format on
-
-using namespace llvm;
-using namespace llvm::sampleprof;
-
-TEST(MD5CollisionTest, TestCollision) {
-  auto InputBuffer = MemoryBuffer::getMemBuffer(
-      StringRef(reinterpret_cast<const char *>(ProfileData),
-                sizeof(ProfileData)),
-      "", false);
-  LLVMContext Context;
-  auto FileSystem = vfs::getRealFileSystem();
-  auto Result = SampleProfileReader::create(InputBuffer, Context, *FileSystem);
-  ASSERT_TRUE(Result);
-  SampleProfileReader *Reader = Result->get();
-  ASSERT_FALSE(Reader->read());
-
-  std::vector<StringRef> &NameTable = *Reader->getNameTable();
-  ASSERT_EQ(NameTable.size(), 2U);
-  StringRef S1 = NameTable[0];
-  StringRef S2 = NameTable[1];
-  ASSERT_NE(S1, S2);
-  ASSERT_EQ(MD5Hash(S1), ExpectedHash);
-  ASSERT_EQ(MD5Hash(S2), ExpectedHash);
-
-  // S2's MD5 value collides with S1, S1 is expected to be dropped when S2 is
-  // inserted, as if S1 never existed.
-
-  FunctionSamples ExpectedFS;
-  ExpectedFS.setName(S2);
-  ExpectedFS.setHeadSamples(3);
-  ExpectedFS.setTotalSamples(7);
-  ExpectedFS.addBodySamples(9, 0, 0);
-
-  SampleProfileMap &Profiles = Reader->getProfiles();
-  EXPECT_EQ(Profiles.size(), 1U);
-  if (Profiles.size()) {
-    auto &[Hash, FS] = *Profiles.begin();
-    EXPECT_EQ(Hash, hash_code(ExpectedHash));
-    EXPECT_EQ(FS, ExpectedFS);
-  }
-
-  // Inserting S2 again should fail, returning the existing sample unchanged.
-  auto [It1, Inserted1] = Profiles.try_emplace(S2, FunctionSamples());
-  EXPECT_FALSE(Inserted1);
-  EXPECT_EQ(Profiles.size(), 1U);
-  if (Profiles.size()) {
-    auto &[Hash, FS] = *It1;
-    EXPECT_EQ(Hash, hash_code(ExpectedHash));
-    EXPECT_EQ(FS, ExpectedFS);
-  }
-
-  // Inserting S1 should success as if S2 never existed, and S2 is erased.
-  FunctionSamples FS1;
-  FS1.setName(S1);
-  FS1.setHeadSamples(5);
-  FS1.setTotalSamples(10);
-  FS1.addBodySamples(1, 2, 5);
-
-  auto [It2, Inserted2] = Profiles.try_emplace(S1, FS1);
-  EXPECT_TRUE(Inserted2);
-  EXPECT_EQ(Profiles.size(), 1U);
-  if (Profiles.size()) {
-    auto &[Hash, FS] = *It2;
-    EXPECT_EQ(Hash, hash_code(ExpectedHash));
-    EXPECT_EQ(FS, FS1);
-  }
-}


        


More information about the llvm-commits mailing list