[PATCH] D40736: [CodeView] Add support for type record content hashing

Tue Dec 5 13:05:01 PST 2017

There are two reasons.

1. If you simply replace our current hash function with SHA1, it is a 2x
slowdown.  SHA1, by itself, is much slower to actually compute.  The
advantages of SHA1 only kick in when combined with the "global" hash
algorithm.  i.e. replacing TypeIndices (which are local to a particular
object file) with previously computed hashes of the records they refer to.
However, making this sweeping change across the board is invasive, and
leads into my next point, which is.

2. I don't want to change *anything* about the current algorithm.  We will
need to be able to iterate on, tune, and benchmark this method against the
current method.  The current method is already very fast, so I don't want
to do anything that could affect the performance adversely.  In particular,
since SHA1 by itself is actually slower to compute, we would probably get a
unacceptable performance regression.  I'd like to be able to later have a
flag to LLD that is hidden and experimental, and when it is not used, it
behaves exactly as it does today, since we already know that is very fast.
And only after we determine that this new approach is sound and iron out
all the details, then we can delete the old codepath.

On Tue, Dec 5, 2017 at 12:57 PM Rui Ueyama <ruiu at google.com> wrote:

> What is the reason to use different hash functions for these two cases? I
> mean if using SHA1 is faster than a noncryptic hash function with content
> comparison, why don't you always use SHA1?
>
> On Tue, Dec 5, 2017 at 7:03 AM, Zachary Turner <zturner at google.com> wrote:
>
>> In the case of a LocallyHashedType, collision doesn’t matter at all
>> because we fall back to a full record comparison when there is a collision.
>> This is the method that is used today.
>>
>> In the PDB, we actually store CRC32s as hashes, which is even worse, but
>> again it doesn’t matter because it’s just to get the bucket, probing will
>> do a full equality check. So collision is not even a theoretical problem
>> for a LocallyHashedType.
>>
>> For a GloballyHashedType, the hash is intended to be “as good as” the
>> record, so instead of a full equality comparison we only compare the full
>> 20 bytes of SHA1 hash.  In this case, collision is a theoretical problem ,
>> but with probability O(10^-18) because a type stream can’t have more than
>> 2^32 elements anyway
>>
>>
>> On Mon, Dec 4, 2017 at 11:12 PM Rui Ueyama via Phabricator <
>> reviews at reviews.llvm.org> wrote:
>>
>>> ruiu added inline comments.
>>>
>>>
>>> ================
>>> Comment at: llvm/include/llvm/DebugInfo/CodeView/TypeHashing.h:29
>>> +struct LocallyHashedType {
>>> +  hash_code Hash;
>>> +  ArrayRef<uint8_t> RecordData;
>>> ----------------
>>> Is this used when an object file doesn't have type record hash values?
>>>
>>> If you use 64-bit values as unique keys and want to maintain a
>>> probability of collision lower than 10^-9, for example, the maximum number
>>> of type records you can have is 190,000, according to [1]. Is this enough?
>>>
>>> https://en.wikipedia.org/wiki/Birthday_problem#Probability_table
>>>
>>>
>>> ================
>>> Comment at: llvm/include/llvm/DebugInfo/CodeView/TypeHashing.h:41
>>> +/// global hashes of the types that B refers to), a global hash can
>>> uniquely
>>> +/// identify identify that A occurs in another stream that has a
>>> completely
>>> +/// different graph structure.  Although the hash itself is slower to
>>> compute,
>>> ----------------
>>> identify
>>>
>>>
>>> https://reviews.llvm.org/D40736
>>>
>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20171205/7fa61c25/attachment.html>