<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Dec 8, 2015 at 11:33 PM, Justin Lebar <span dir="ltr"><<a href="mailto:jlebar@google.com" target="_blank">jlebar@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">> Would the container writer be able to manipulate a bitstream?<br>

> Currently, the container just gets a hash function to call, whose<br>

> input is a key and whose output is size_t.<br>

<br>

</span>I don't see why it couldn't manipulate the bitstream?  If I'm reading<br>

this API right, it could do something like<br>

<br>

  size_t hash(const T& t) {<br>

    return hash_combine(std::hash_code(), seed, t)();<br>

<span class="">  }<br>

<br>

> Also, the ideal number of bits to prepend, if you're going that route, would be different for different hash function families.<br>

<br>

</span>I guess what you're saying is that the algorithm keeps some number of<br>

bits as internal state, and ideally you'd want to fill up all of those<br>

bits with a seed, rather than just using a size_t?<br>

<br>

But in practice we're not usually interested in exploring the full<br>

range of possible hashers, especially given that flooding protection<br>

is already being taken care of by randomizing at startup.  If you know<br>

you'll need n different hashers, is it not sufficient to seed with<br>

values in the range [0, n)?<br>

<br>

I guess you could just pass initial state to the hash_code constructor<br>

(again if I'm reading this API right).  But that's kind of an ugly API<br>

if the constructor takes a different number of bytes depending on the<br>

hash function family.  Maybe if this information is really needed it<br>

would be better to expose a typedef on some hasher type that tells you<br>

how many bits of entropy it wants -- then it's up to you to prepend<br>

the right amount.<br>

<br>

<br>

This all brings up another concern:<br>

<br>

Looking at the sample implementation [1], hashing a small value, e.g.<br>

a pointer, an int64, or even a short string seems necessarily slower<br>

than if I just implemented std::hash<T> to call FarmHash64 directly.<br>

The reason being that the new API doesn't know a priori how many bytes<br>

I'm going to be hashing, and I'm free to split up those bytes into as<br>

many calls to the hasher as I want.  Therefore it has to carefully<br>

buffer things until I finalize the hasher.<br>

<br>

In contrast, if I just call FarmHash64(some_int), it drops right into<br>

a fast path and doesn't do any buffering and so on.<br>

<br>

I like that the proposal lets us hash aggregate types (vectors,<br>

structs, etc.) while preserving the full state of the hasher (that is,<br>

without finalizing down to size_t), but in accordance with the<br>

zero-cost principle of C++, it's probably important that simple types<br>

are hashed as fast as you could hash them if you were invoking<br>

FarmHash64 directly?<br></blockquote><div><br></div><div>The implementation is permitted to specialize std::hash<T> to call FarmHash64 (or whatever) directly when T doesn't depend on a user-defined type (and the user is permitted to do so when T does depend on a user-defined type), so it's possible to fix this as a QOI matter if it arises in practice. However, I don't expect this to be a practical issue. The API has been very carefully crafted to ensure that the optimizer can discover and take advantage of situations where only a fixed number of bytes are being hashed, and even if it can't eliminate the buffering, the cost of a single memcpy is quite small compared to the cost of evaluating a modern hash function.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

[1] <a href="https://github.com/google/hashing-demo/blob/N0029R0/farmhash.h" rel="noreferrer" target="_blank">https://github.com/google/hashing-demo/blob/N0029R0/farmhash.h</a><br>

<div class="HOEnZb"><div class="h5"><br>

On Tue, Dec 8, 2015 at 7:37 PM, Geoff Pike <<a href="mailto:gpike@google.com">gpike@google.com</a>> wrote:<br>

> On Tue, Dec 8, 2015 at 5:07 PM, Justin Lebar <<a href="mailto:jlebar@google.com">jlebar@google.com</a>> wrote:<br>

>>> The thing I would like is a more explicit discussion of hash function families, and the ability to select from a family.<br>

>><br>

>> How would this be different from the hashtable / bloom filter hashing<br>

>> in a random number into the beginning or end of the bitstream?  Is the<br>

>> idea that we could be more efficient than this, or that we could get<br>

>> better hashes if we had explicit support for seeding the hash?<br>

><br>

> Would the container writer be able to manipulate a bitstream?<br>

> Currently, the container just gets a hash function to call, whose<br>

> input is a key and whose output is size_t.<br>

><br>

> Also, the ideal number of bits to prepend, if you're going that route,<br>

> would be different for different hash function families. It seems more<br>

> logical to me to make the interface more explicit about the existence<br>

> of hash function families.<br>

><br>

> Geoff<br>

</div></div></blockquote></div><br></div></div>