[llvm-commits] [llvm] r152116 - /llvm/trunk/lib/VMCore/ConstantsContext.h

Wed Mar 7 21:04:35 PST 2012

On Wed, Mar 7, 2012 at 8:14 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk>wrote:

>
> On Mar 7, 2012, at 12:29 AM, Duncan Sands wrote:
>
> >> I think you underestimate how slow it is to hash things using
> incremental calls
> >> to hash_combine. The reason for the other interface is that it is
> *much* faster.
> >
> > why is this?  Is it just that the optimizers are doing a poor job, or is
> there a
> > more fundamental reason that means that multiple hash_combine calls
> can't be as
> > efficient as hash_combine_range?
>
> Good hash functions have an internal state that is much larger than the
> final hash. Chandler's variant of CityHash has 6 x 64-bit state registers
> in hash_state. The large internal state is essential for cryptographic hash
> functions to prevent collisions, and hashes that simply try to be fast can
> be less careful about losing entropy when adding new data.
>
> The normal operation of a hash function is to accumulate entropy from all
> input data in the internal state. When all entropy has been collected, the
> internal state is projected onto the smaller output space, a size_t /
> hash_code in this case. The final projection is less performance sensitive
> than the accumulation since it only runs once per hash.
>

So far, this is a great explanation.

> The hash_combine interface forces hashes to be computed as an expression
> tree. Every call to hash_combine projects onto the smaller output space
> prematurely, and the projection function is executed for each node of the
> expression tree instead of just once. This increases the likelihood of
> collisions, and it makes the hash more expensive to compute than it needs
> to be.
>

While this is a bit true, it's only a bit. There are a couple of important
factors to the algorithm in question.

First is that final cost you are referring to as "projecting onto a smaller
space". This is usually referred to as the "finalization" phase. This isn't
actually terribly slow in modern hash functions (such as the one I've
added). It's not as free as continuing to hash a run of contiguous data,
but it's quite likely cheaper than crossing cache lines for example.

Second is that it doesn't actually increase the likelihood of collisions
significantly. I was actually surprised by this, but talking to our hashing
experts (authors of City and Murmur), they indicated it really wasn't
significant for these types of hashing functions. (This clearly isn't
necessarily true or relevant for cryptographic functions.)

> Chandler is trying to minimize the number of nodes in the expression tree
> by avoiding hash_combine() calls.

Not quite. If it were *just* the nodes in the graph, it really might not
make such a big difference. I think the big difference comes from placing
all the data to be hashed into a dense contiguous buffer that we can read
out of efficiently.

The functions actually do this internally when they aren't called with an
existing buffer of suitable form, but in this particular case, there isn't
a great way to take advantage of that... More on that in a bit.

I don't think that allocating memory in a hash function is a good
> alternative, though.
>

True, in the general case, this isn't a good tradeoff. But what I've
actually done (and note that this has only come up in one place throughout
LLVM thus far) is a bit more complex than that. I used a SmallVector to
build up small buffers directly on the stack, and only fall over to
allocating from the heap when we're hashing a significant amount of data.

This *still* isn't ideal, I agree. The hashing routines were actually
designed to avoid this by keeping a single fixed-size buffer (hopefully on
the stack), filling and re-using it iteratively. The only case where I
think I currently use this strategy is a case where we don't have a good
iterator to walk over the elements we want to hash. Given an iterator,
we'll used a fixed buffer, and recycle that data iteratively.

The goal of all of this is not really about avoiding the finalization step,
it's about reading contiguous, sequential data as opposed to jumping from
point-to-point.

I would propose an interface that doesn't constantly collapse the hash
> state when hashing composite objects. The hash_state class can be given an
> interface very similar to raw_ostream:
>
> hash_state &operator<<(hash_state &HS, const ConstantClass *CP) {
>  HS << CP->getType();
>   for (unsigned I = 0, E = CP->getNumOperands(); I < E; ++I)
>     HS << CP->getOperand(I);
>  return HS;
> }
>
> The hash_state object is created by the final consumer of the hash value,
> and passed around by reference to all the parts of a composite object.

This option was definitely looked at by a bunch of us when designing the
interface for the standards proposal. It was one of the favorites even.
Everyone felt that this was a workable interface, but that it required more
boiler plate code to stitch together. There was a strong desire to remove
almost all loops, as those tend to be expensive both to write and to
optimize.

I think it's not an unreasonable goal to have the above look like:

hash_combine(CP->getType(),
             hash_combine_range(CP->operand_begin(), CP->operand_end()));

This shouldn't require allocating anything on the heap, and it is really
unlikely to be noticeably slower in practice. The slow part is likely to be
reading the data above that is not in-cache.

That said, I wouldn't look at the generated code yet. The above interface
being efficient was definitely predicated on the idea that the optimizer
would Do The Right Thing. Currently, it doesn't. I'm trying to fix that. ;]
Without that fix, the function call overhead alone will actually suck up
most of the time.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20120307/3e84e18b/attachment.html>