<div dir="ltr">There are two reasons.<div><br></div><div>1. If you simply replace our current hash function with SHA1, it is a 2x slowdown.  SHA1, by itself, is much slower to actually compute.  The advantages of SHA1 only kick in when combined with the "global" hash algorithm.  i.e. replacing TypeIndices (which are local to a particular object file) with previously computed hashes of the records they refer to.  However, making this sweeping change across the board is invasive, and leads into my next point, which is.</div><div><br></div><div>2. I don't want to change *anything* about the current algorithm.  We will need to be able to iterate on, tune, and benchmark this method against the current method.  The current method is already very fast, so I don't want to do anything that could affect the performance adversely.  In particular, since SHA1 by itself is actually slower to compute, we would probably get a unacceptable performance regression.  I'd like to be able to later have a flag to LLD that is hidden and experimental, and when it is not used, it behaves exactly as it does today, since we already know that is very fast.  And only after we determine that this new approach is sound and iron out all the details, then we can delete the old codepath.</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr">On Tue, Dec 5, 2017 at 12:57 PM Rui Ueyama <<a href="mailto:ruiu@google.com">ruiu@google.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">What is the reason to use different hash functions for these two cases? I mean if using SHA1 is faster than a noncryptic hash function with content comparison, why don't you always use SHA1?</div></div></div><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><br></div><div class="gmail_quote">On Tue, Dec 5, 2017 at 7:03 AM, Zachary Turner <span dir="ltr"><<a href="mailto:zturner@google.com" target="_blank">zturner@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>In the case of a LocallyHashedType, collision doesn’t matter at all because we fall back to a full record comparison when there is a collision.   This is the method that is used today.<br><br>In the PDB, we actually store CRC32s as hashes, which is even worse, but again it doesn’t matter because it’s just to get the bucket, probing will do a full equality check.  So collision is not even a theoretical problem for a LocallyHashedType.<br><br><div dir="auto">For a GloballyHashedType, the hash is intended to be “as good as” the record, so instead of a full equality comparison we only compare the full 20 bytes of SHA1 hash.  In this case, collision is a theoretical problem , but with probability O(10^-18) because a type stream can’t have more than 2^32 elements anyway </div></div><div class="m_2396220923965516071HOEnZb"><div class="m_2396220923965516071h5"><div><br><br><div class="gmail_quote"><div>On Mon, Dec 4, 2017 at 11:12 PM Rui Ueyama via Phabricator <<a href="mailto:reviews@reviews.llvm.org" target="_blank">reviews@reviews.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">ruiu added inline comments.<br>

<br>

<br>

================<br>

Comment at: llvm/include/llvm/DebugInfo/CodeView/TypeHashing.h:29<br>

+struct LocallyHashedType {<br>

+  hash_code Hash;<br>

+  ArrayRef<uint8_t> RecordData;<br>

----------------<br>

Is this used when an object file doesn't have type record hash values?<br>

<br>

If you use 64-bit values as unique keys and want to maintain a probability of collision lower than 10^-9, for example, the maximum number of type records you can have is 190,000, according to [1]. Is this enough?<br>

<br>

<a href="https://en.wikipedia.org/wiki/Birthday_problem#Probability_table" rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/Birthday_problem#Probability_table</a><br>

<br>

<br>

================<br>

Comment at: llvm/include/llvm/DebugInfo/CodeView/TypeHashing.h:41<br>

+/// global hashes of the types that B refers to), a global hash can uniquely<br>

+/// identify identify that A occurs in another stream that has a completely<br>

+/// different graph structure.  Although the hash itself is slower to compute,<br>

----------------<br>

identify<br>

<br>

<br>

<a href="https://reviews.llvm.org/D40736" rel="noreferrer" target="_blank">https://reviews.llvm.org/D40736</a><br>

<br>

<br>

<br>

</blockquote></div></div>

</div></div></blockquote></div><br></div></div></blockquote></div>