[PATCH] D40736: [CodeView] Add support for type record content hashing

Tue Dec 5 13:22:28 PST 2017

We have a couple of ideas to explore with regards to mixed input files.

The most straightforward approach is just for any record that doesn't have
a hash in the object file, let the merging algorithm do it on the fly.
This would be easy, and it would be faster most of the time, but would be
slower sometimes  (depends on the ratio of missing / present hashes)

Another approach is we could examine each object file up front, and compute
all missing hashes in parallel.  This approach seems strictly better than
#1, but I mention #1 anyway because it's the "dumbest" approach.

We could also have a separate tool that pre-processes object files and
library archives to add the hashes to any missing object files and re-write
the archive.  (Even if we do this, we still need to be fast in case people
don't use this for pre-processing their inputs, so #2 is still useful)

On Tue, Dec 5, 2017 at 1:12 PM Rui Ueyama <ruiu at google.com> wrote:

> On Tue, Dec 5, 2017 at 1:05 PM, Zachary Turner <zturner at google.com> wrote:
>
>> There are two reasons.
>>
>> 1. If you simply replace our current hash function with SHA1, it is a 2x
>> slowdown.  SHA1, by itself, is much slower to actually compute.  The
>> advantages of SHA1 only kick in when combined with the "global" hash
>> algorithm.  i.e. replacing TypeIndices (which are local to a particular
>> object file) with previously computed hashes of the records they refer to.
>> However, making this sweeping change across the board is invasive, and
>> leads into my next point, which is.
>>
>
> So, it is a tree hash. Somewhat orthogonal, but it might be worth noting
> that the tree hash is not limited to a cryptographically-safe hash
> function. You can use any hash function to compute a hash value of other
> hash values, though you need to prepare for collisions if it is not
> cryptographic hash function.
>
> 2. I don't want to change *anything* about the current algorithm.  We will
>> need to be able to iterate on, tune, and benchmark this method against the
>> current method.  The current method is already very fast, so I don't want
>> to do anything that could affect the performance adversely.  In particular,
>> since SHA1 by itself is actually slower to compute, we would probably get a
>> unacceptable performance regression.  I'd like to be able to later have a
>> flag to LLD that is hidden and experimental, and when it is not used, it
>> behaves exactly as it does today, since we already know that is very fast.
>> And only after we determine that this new approach is sound and iron out
>> all the details, then we can delete the old codepath.
>>
>
> Makes sense. I'd guess that eventually you'll make a change to compute the
> same hash values as "global" hash so that you can handle a set of input
> files that are mix of with and without the hash values. Is my understanding
> correct?
>
>
>>
>> On Tue, Dec 5, 2017 at 12:57 PM Rui Ueyama <ruiu at google.com> wrote:
>>
>>> What is the reason to use different hash functions for these two cases?
>>> I mean if using SHA1 is faster than a noncryptic hash function with content
>>> comparison, why don't you always use SHA1?
>>>
>>> On Tue, Dec 5, 2017 at 7:03 AM, Zachary Turner <zturner at google.com>
>>> wrote:
>>>
>>>> In the case of a LocallyHashedType, collision doesn’t matter at all
>>>> because we fall back to a full record comparison when there is a collision.
>>>> This is the method that is used today.
>>>>
>>>> In the PDB, we actually store CRC32s as hashes, which is even worse,
>>>> but again it doesn’t matter because it’s just to get the bucket, probing
>>>> will do a full equality check. So collision is not even a theoretical
>>>> problem for a LocallyHashedType.
>>>>
>>>> For a GloballyHashedType, the hash is intended to be “as good as” the
>>>> record, so instead of a full equality comparison we only compare the full
>>>> 20 bytes of SHA1 hash.  In this case, collision is a theoretical problem ,
>>>> but with probability O(10^-18) because a type stream can’t have more than
>>>> 2^32 elements anyway
>>>>
>>>>
>>>> On Mon, Dec 4, 2017 at 11:12 PM Rui Ueyama via Phabricator <
>>>> reviews at reviews.llvm.org> wrote:
>>>>
>>>>> ruiu added inline comments.
>>>>>
>>>>>
>>>>> ================
>>>>> Comment at: llvm/include/llvm/DebugInfo/CodeView/TypeHashing.h:29
>>>>> +struct LocallyHashedType {
>>>>> +  hash_code Hash;
>>>>> +  ArrayRef<uint8_t> RecordData;
>>>>> ----------------
>>>>> Is this used when an object file doesn't have type record hash values?
>>>>>
>>>>> If you use 64-bit values as unique keys and want to maintain a
>>>>> probability of collision lower than 10^-9, for example, the maximum number
>>>>> of type records you can have is 190,000, according to [1]. Is this enough?
>>>>>
>>>>> https://en.wikipedia.org/wiki/Birthday_problem#Probability_table
>>>>>
>>>>>
>>>>> ================
>>>>> Comment at: llvm/include/llvm/DebugInfo/CodeView/TypeHashing.h:41
>>>>> +/// global hashes of the types that B refers to), a global hash can
>>>>> uniquely
>>>>> +/// identify identify that A occurs in another stream that has a
>>>>> completely
>>>>> +/// different graph structure.  Although the hash itself is slower to
>>>>> compute,
>>>>> ----------------
>>>>> identify
>>>>>
>>>>>
>>>>> https://reviews.llvm.org/D40736
>>>>>
>>>>>
>>>>>
>>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20171205/c16165cc/attachment-0001.html>