[LLVMdev] We need better hashing

Fri Feb 17 01:46:29 PST 2012

Sorry for coming late to this thread. I've been very active in recent work
on hashing routines inside and outside of Google, so I've a keen interest
in this...

Some initial clarifications:

On Tue, Feb 14, 2012 at 2:26 AM, Chris Lattner <clattner at apple.com> wrote:

>
> On Feb 13, 2012, at 1:27 AM, Jay Foad wrote:
>
> > On 13 February 2012 09:22, Jay Foad <jay.foad at gmail.com> wrote:
> >> Would it be possible to use CityHash instead for strings?
> >>
> >> http://code.google.com/p/cityhash/
> >
> > Incidentally there was talk of using CityHash for LLVM's StringMap
> > last year, but I don't think it ever came to anything:
> >
> > http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-April/014656.html
>
> At that point, there wasn't a clear win.  CityHash (iirc) is optimized for
> huge strings, which aren't the usual case in the compiler AFAIK.
>

This isn't entirely accurate. Let me try to clarify.

For reference, I'm the one that did this. I looked at both DenseMap and
StringMap quite closely. I replaced all of the hashing I could to use
CityHash. The reason I did this was because collisions in the hash table
showed up on the profile in a few test cases with Clang.

However, when measuring it, what I observed was that while the collisions
did show up, they were still small enough overall that increasing the
complexity of the hash function to achieve fewer collisions was not in fact
a good tradeoff. They just didn't happen *that* often.

CityHash has some specializations for larger strings to make it a
reasonably good hash function, but for truly huge strings, it still isn't
ideal. There are some very interesting CRC instruction based hashing
strategies that are much more promising for *huge* datasets. The goal of
CityHash is to be an excellent hashing function for *general* usage. That
includes small strings, ints, structs, and other common keys.

That said, any really high quality hash function is going to spend more
cycles-per-byte on small inputs than an large inputs because it is harder
to "mix" smaller inputs sufficiently. This doesn't make CityHash bad for
general usage, it's just something to be aware of.

I measured *no* performance degradation from switching to cityhash. These
benchmarks as well as others led to libc++ adopting CityHash for its
unordered containers. The reason I didn't push forward is that I also
measured no performance improvements from it for LLVM and Clang. The reason
seemed pretty clear: there aren't enough collisions for a higher quality
hashing algorithm to have a high impact. LLVM and Clang's tables are
(relatively speaking) very small. Doesn't mean a bad algorithm should be
kept around, but it limited the amount of time I wanted to invest.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120217/8e90aa1e/attachment.html>