[PATCH] D22512: Added hash_stream class for producing hash codes from data streams.

Chandler Carruth via llvm-commits llvm-commits at lists.llvm.org
Thu Aug 18 00:40:10 PDT 2016


chandlerc added a comment.

In https://reviews.llvm.org/D22512#518875, @teemperor wrote:

> The immediate motivation is that in https://reviews.llvm.org/D22515 we need to generate a hash code for data that isn't in a container but implicitly stored in the properties of some AST nodes.


Ok, thanks. This really helps.

The current code in Hashing.h is really strongly engineered toward container usage though. I'm not sure it is a reasonable approach for many other uses.

As one example, it is designed to be statistically resilient to collisions in the space in which containers are likely to exist, and unbiased if high bits are masked off. The use case you suggest doesn't seem necessarily to fit either of those.

I can imagine clone detection actually not wanting *any* collisions -- it essentially might want a *fingerprint* or *signature* rather than merely a hash code. If that is the case, I think an API for doing online-updates of MD5 (or better yet Blake2, but that isn't in-tree) would be a much better choice.

I can also imagine clone detection using this more like a hash-similarity search or bloom filter. In that case, cityhash is very likely to be a much more rigorous (and slow) hash than you would want.

Have you looked at these options at all? If so, what tradeoffs made them unappealing and made cityhash itself appealing?


https://reviews.llvm.org/D22512





More information about the llvm-commits mailing list