[cfe-dev] [libcxx] Extending the behavior of std::hash in libc++

Wed Dec 9 08:58:42 PST 2015

On Tue, Dec 8, 2015 at 11:33 PM, Justin Lebar <jlebar at google.com> wrote:

> > Would the container writer be able to manipulate a bitstream?
> > Currently, the container just gets a hash function to call, whose
> > input is a key and whose output is size_t.
>
> I don't see why it couldn't manipulate the bitstream?  If I'm reading
> this API right, it could do something like
>
>   size_t hash(const T& t) {
>     return hash_combine(std::hash_code(), seed, t)();
>   }
>
> > Also, the ideal number of bits to prepend, if you're going that route,
> would be different for different hash function families.
>
> I guess what you're saying is that the algorithm keeps some number of
> bits as internal state, and ideally you'd want to fill up all of those
> bits with a seed, rather than just using a size_t?
>
> But in practice we're not usually interested in exploring the full
> range of possible hashers, especially given that flooding protection
> is already being taken care of by randomizing at startup.  If you know
> you'll need n different hashers, is it not sufficient to seed with
> values in the range [0, n)?
>
> I guess you could just pass initial state to the hash_code constructor
> (again if I'm reading this API right).  But that's kind of an ugly API
> if the constructor takes a different number of bytes depending on the
> hash function family.  Maybe if this information is really needed it
> would be better to expose a typedef on some hasher type that tells you
> how many bits of entropy it wants -- then it's up to you to prepend
> the right amount.
>
>
> This all brings up another concern:
>
> Looking at the sample implementation [1], hashing a small value, e.g.
> a pointer, an int64, or even a short string seems necessarily slower
> than if I just implemented std::hash<T> to call FarmHash64 directly.
> The reason being that the new API doesn't know a priori how many bytes
> I'm going to be hashing, and I'm free to split up those bytes into as
> many calls to the hasher as I want.  Therefore it has to carefully
> buffer things until I finalize the hasher.
>
> In contrast, if I just call FarmHash64(some_int), it drops right into
> a fast path and doesn't do any buffering and so on.
>
> I like that the proposal lets us hash aggregate types (vectors,
> structs, etc.) while preserving the full state of the hasher (that is,
> without finalizing down to size_t), but in accordance with the
> zero-cost principle of C++, it's probably important that simple types
> are hashed as fast as you could hash them if you were invoking
> FarmHash64 directly?
>

The implementation is permitted to specialize std::hash<T> to call
FarmHash64 (or whatever) directly when T doesn't depend on a user-defined
type (and the user is permitted to do so when T does depend on a
user-defined type), so it's possible to fix this as a QOI matter if it
arises in practice. However, I don't expect this to be a practical issue.
The API has been very carefully crafted to ensure that the optimizer can
discover and take advantage of situations where only a fixed number of
bytes are being hashed, and even if it can't eliminate the buffering, the
cost of a single memcpy is quite small compared to the cost of evaluating a
modern hash function.

>
> [1] https://github.com/google/hashing-demo/blob/N0029R0/farmhash.h
>
> On Tue, Dec 8, 2015 at 7:37 PM, Geoff Pike <gpike at google.com> wrote:
> > On Tue, Dec 8, 2015 at 5:07 PM, Justin Lebar <jlebar at google.com> wrote:
> >>> The thing I would like is a more explicit discussion of hash function
> families, and the ability to select from a family.
> >>
> >> How would this be different from the hashtable / bloom filter hashing
> >> in a random number into the beginning or end of the bitstream?  Is the
> >> idea that we could be more efficient than this, or that we could get
> >> better hashes if we had explicit support for seeding the hash?
> >
> > Would the container writer be able to manipulate a bitstream?
> > Currently, the container just gets a hash function to call, whose
> > input is a key and whose output is size_t.
> >
> > Also, the ideal number of bits to prepend, if you're going that route,
> > would be different for different hash function families. It seems more
> > logical to me to make the interface more explicit about the existence
> > of hash function families.
> >
> > Geoff
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20151209/1ad333f8/attachment.html>