[cfe-dev] [libcxx] Extending the behavior of std::hash in libc++

Tue Dec 8 23:33:19 PST 2015

> Would the container writer be able to manipulate a bitstream?
> Currently, the container just gets a hash function to call, whose
> input is a key and whose output is size_t.

I don't see why it couldn't manipulate the bitstream?  If I'm reading
this API right, it could do something like

  size_t hash(const T& t) {
    return hash_combine(std::hash_code(), seed, t)();
  }

> Also, the ideal number of bits to prepend, if you're going that route, would be different for different hash function families.

I guess what you're saying is that the algorithm keeps some number of
bits as internal state, and ideally you'd want to fill up all of those
bits with a seed, rather than just using a size_t?

But in practice we're not usually interested in exploring the full
range of possible hashers, especially given that flooding protection
is already being taken care of by randomizing at startup.  If you know
you'll need n different hashers, is it not sufficient to seed with
values in the range [0, n)?

I guess you could just pass initial state to the hash_code constructor
(again if I'm reading this API right).  But that's kind of an ugly API
if the constructor takes a different number of bytes depending on the
hash function family.  Maybe if this information is really needed it
would be better to expose a typedef on some hasher type that tells you
how many bits of entropy it wants -- then it's up to you to prepend
the right amount.

This all brings up another concern:

Looking at the sample implementation [1], hashing a small value, e.g.
a pointer, an int64, or even a short string seems necessarily slower
than if I just implemented std::hash<T> to call FarmHash64 directly.
The reason being that the new API doesn't know a priori how many bytes
I'm going to be hashing, and I'm free to split up those bytes into as
many calls to the hasher as I want.  Therefore it has to carefully
buffer things until I finalize the hasher.

In contrast, if I just call FarmHash64(some_int), it drops right into
a fast path and doesn't do any buffering and so on.

I like that the proposal lets us hash aggregate types (vectors,
structs, etc.) while preserving the full state of the hasher (that is,
without finalizing down to size_t), but in accordance with the
zero-cost principle of C++, it's probably important that simple types
are hashed as fast as you could hash them if you were invoking
FarmHash64 directly?

[1] https://github.com/google/hashing-demo/blob/N0029R0/farmhash.h

On Tue, Dec 8, 2015 at 7:37 PM, Geoff Pike <gpike at google.com> wrote:
> On Tue, Dec 8, 2015 at 5:07 PM, Justin Lebar <jlebar at google.com> wrote:
>>> The thing I would like is a more explicit discussion of hash function families, and the ability to select from a family.
>>
>> How would this be different from the hashtable / bloom filter hashing
>> in a random number into the beginning or end of the bitstream?  Is the
>> idea that we could be more efficient than this, or that we could get
>> better hashes if we had explicit support for seeding the hash?
>
> Would the container writer be able to manipulate a bitstream?
> Currently, the container just gets a hash function to call, whose
> input is a key and whose output is size_t.
>
> Also, the ideal number of bits to prepend, if you're going that route,
> would be different for different hash function families. It seems more
> logical to me to make the interface more explicit about the existence
> of hash function families.
>
> Geoff