[PATCH] D40736: [CodeView] Add support for type record content hashing

Tue Dec 5 13:12:27 PST 2017

On Tue, Dec 5, 2017 at 1:05 PM, Zachary Turner <zturner at google.com> wrote:

> There are two reasons.
>
> 1. If you simply replace our current hash function with SHA1, it is a 2x
> slowdown.  SHA1, by itself, is much slower to actually compute.  The
> advantages of SHA1 only kick in when combined with the "global" hash
> algorithm.  i.e. replacing TypeIndices (which are local to a particular
> object file) with previously computed hashes of the records they refer to.
> However, making this sweeping change across the board is invasive, and
> leads into my next point, which is.
>

So, it is a tree hash. Somewhat orthogonal, but it might be worth noting
that the tree hash is not limited to a cryptographically-safe hash
function. You can use any hash function to compute a hash value of other
hash values, though you need to prepare for collisions if it is not
cryptographic hash function.

2. I don't want to change *anything* about the current algorithm.  We will
> need to be able to iterate on, tune, and benchmark this method against the
> current method.  The current method is already very fast, so I don't want
> to do anything that could affect the performance adversely.  In particular,
> since SHA1 by itself is actually slower to compute, we would probably get a
> unacceptable performance regression.  I'd like to be able to later have a
> flag to LLD that is hidden and experimental, and when it is not used, it
> behaves exactly as it does today, since we already know that is very fast.
> And only after we determine that this new approach is sound and iron out
> all the details, then we can delete the old codepath.
>

Makes sense. I'd guess that eventually you'll make a change to compute the
same hash values as "global" hash so that you can handle a set of input
files that are mix of with and without the hash values. Is my understanding
correct?

>
> On Tue, Dec 5, 2017 at 12:57 PM Rui Ueyama <ruiu at google.com> wrote:
>
>> What is the reason to use different hash functions for these two cases? I
>> mean if using SHA1 is faster than a noncryptic hash function with content
>> comparison, why don't you always use SHA1?
>>
>> On Tue, Dec 5, 2017 at 7:03 AM, Zachary Turner <zturner at google.com>
>> wrote:
>>
>>> In the case of a LocallyHashedType, collision doesn’t matter at all
>>> because we fall back to a full record comparison when there is a collision.
>>> This is the method that is used today.
>>>
>>> In the PDB, we actually store CRC32s as hashes, which is even worse, but
>>> again it doesn’t matter because it’s just to get the bucket, probing will
>>> do a full equality check. So collision is not even a theoretical problem
>>> for a LocallyHashedType.
>>>
>>> For a GloballyHashedType, the hash is intended to be “as good as” the
>>> record, so instead of a full equality comparison we only compare the full
>>> 20 bytes of SHA1 hash.  In this case, collision is a theoretical problem ,
>>> but with probability O(10^-18) because a type stream can’t have more than
>>> 2^32 elements anyway
>>>
>>>
>>> On Mon, Dec 4, 2017 at 11:12 PM Rui Ueyama via Phabricator <
>>> reviews at reviews.llvm.org> wrote:
>>>
>>>> ruiu added inline comments.
>>>>
>>>>
>>>> ================
>>>> Comment at: llvm/include/llvm/DebugInfo/CodeView/TypeHashing.h:29
>>>> +struct LocallyHashedType {
>>>> +  hash_code Hash;
>>>> +  ArrayRef<uint8_t> RecordData;
>>>> ----------------
>>>> Is this used when an object file doesn't have type record hash values?
>>>>
>>>> If you use 64-bit values as unique keys and want to maintain a
>>>> probability of collision lower than 10^-9, for example, the maximum number
>>>> of type records you can have is 190,000, according to [1]. Is this enough?
>>>>
>>>> https://en.wikipedia.org/wiki/Birthday_problem#Probability_table
>>>>
>>>>
>>>> ================
>>>> Comment at: llvm/include/llvm/DebugInfo/CodeView/TypeHashing.h:41
>>>> +/// global hashes of the types that B refers to), a global hash can
>>>> uniquely
>>>> +/// identify identify that A occurs in another stream that has a
>>>> completely
>>>> +/// different graph structure.  Although the hash itself is slower to
>>>> compute,
>>>> ----------------
>>>> identify
>>>>
>>>>
>>>> https://reviews.llvm.org/D40736
>>>>
>>>>
>>>>
>>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20171205/5157e613/attachment.html>