[cfe-dev] [libcxx] Extending the behavior of std::hash in libc++

Wed Dec 9 09:18:10 PST 2015

>> Would the container writer be able to manipulate a bitstream?
>> Currently, the container just gets a hash function to call, whose
>> input is a key and whose output is size_t.
>
> I don't see why it couldn't manipulate the bitstream?  If I'm reading
> this API right, it could do something like
>
>   size_t hash(const T& t) {
>     return hash_combine(std::hash_code(), seed, t)();
>   }
>
>> Also, the ideal number of bits to prepend, if you're going that route, would be different for different hash function families.
>
> I guess what you're saying is that the algorithm keeps some number of
> bits as internal state, and ideally you'd want to fill up all of those
> bits with a seed, rather than just using a size_t?

Yes. Traditionally, when you select from a hash function family you
initialize some state. (In crypto literature, that's the "key," and
the thing to be hashed is the "message.") The size of the state and
the allowable range for each number in it depends on the family.
Sometimes an allowable range isn't a power of 2. (For a hairy example,
see vmac_set_key:
http://lxr.free-electrons.com/source/crypto/vmac.c#L487 .)

> But in practice we're not usually interested in exploring the full
> range of possible hashers, especially given that flooding protection
> is already being taken care of by randomizing at startup.  If you know

Randomization at startup would offer some protection, but it doesn't
solve the hash-flooding problem; with various hash-function families
it is conjectured to thwart the worst hash-flooding attacks. I
strongly believe hash-based containers should have some hash-flooding
detection and evasion, and I'm working on that. One idea that looks
promising is to have containers use a quick-and-dirty hash function
most of the time, with judicious use of
randomly-selected-slower-but-nicer hash functions as needed.

> you'll need n different hashers, is it not sufficient to seed with
> values in the range [0, n)?
>
> I guess you could just pass initial state to the hash_code constructor
> (again if I'm reading this API right).  But that's kind of an ugly API
> if the constructor takes a different number of bytes depending on the
> hash function family.  Maybe if this information is really needed it
> would be better to expose a typedef on some hasher type that tells you
> how many bits of entropy it wants -- then it's up to you to prepend
> the right amount.
>
>
> This all brings up another concern:
>
> Looking at the sample implementation [1], hashing a small value, e.g.
> a pointer, an int64, or even a short string seems necessarily slower
> than if I just implemented std::hash<T> to call FarmHash64 directly.
> The reason being that the new API doesn't know a priori how many bytes
> I'm going to be hashing, and I'm free to split up those bytes into as
> many calls to the hasher as I want.  Therefore it has to carefully
> buffer things until I finalize the hasher.
>
> In contrast, if I just call FarmHash64(some_int), it drops right into
> a fast path and doesn't do any buffering and so on.
>
> I like that the proposal lets us hash aggregate types (vectors,
> structs, etc.) while preserving the full state of the hasher (that is,
> without finalizing down to size_t), but in accordance with the
> zero-cost principle of C++, it's probably important that simple types
> are hashed as fast as you could hash them if you were invoking
> FarmHash64 directly?

Yes, from a performance standpoint, APIs that let me specify exactly
what I'm doing are probably going to be best. E.g., maybe I want to
initialize and use a few hash functions that map strings to integers
in [0, number of buckets). I'm guessing the counterargument relates to
API complexity.

Geoff