[all-commits] [llvm/llvm-project] f4f85e: [llvm-profdata] Remove MD5 collision check in D147...

William Junda Huang via All-commits all-commits at lists.llvm.org
Fri Sep 15 15:31:05 PDT 2023


  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: f4f85e0ab405c89e1b843401a055538bd26a0187
      https://github.com/llvm/llvm-project/commit/f4f85e0ab405c89e1b843401a055538bd26a0187
  Author: William Junda Huang <williamjhuang at google.com>
  Date:   2023-09-15 (Fri, 15 Sep 2023)

  Changed paths:
    M llvm/include/llvm/ProfileData/SampleProf.h
    M llvm/unittests/tools/llvm-profdata/CMakeLists.txt
    R llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp

  Log Message:
  -----------
  [llvm-profdata] Remove MD5 collision check in D147740 (#66544)

This is the patch at https://reviews.llvm.org/D153692, migrating to
Github

After testing D147740 with multiple industrial projects with ~10 million
FunctionSamples, no MD5 collision has been found. In perfect hashing,
the probability of collision for N symbols over K possible hash value is
1 - K!/((K-N)! * K^N). When N is 1 million and K is 2^64, the
probability is 3*10^-8, when N is 10 million the probability is 3*10^-6,
so we are probably not going to find an actual case in real world
application. (However if K is 2^32, the probability of collision is
almost 1, this is indeed a problem, if anyone still use a large profile
on 32-bit machine, as hash_code is tied to size_t). Furthermore, when a
collision happens we can't do anything to recover it, unless using a
multi-map, but that is significantly slower, which contradicts the
purpose of optimizing the profile reader. One more thing, since we have
been using profiles with MD5 names, and they have to be coming from
non-MD5 sources, so if hash collision is to happen, it already happened
when we convert a non-MD5 profile to a MD5 one, so there's no point to
check for that in the reader, and this feature can be removed.




More information about the All-commits mailing list