[LLVMdev] Proposed implementation of N3333 hashing interfaces for LLVM (and possible libc++)

Wed Feb 29 02:52:21 PST 2012

On 29 February 2012 09:35, Chandler Carruth <chandlerc at google.com> wrote:
> I still think we can do more, but it's already much faster than the existing LLVM one except for the issue Tobias pointed out w/ modulo-4 key sizes. I'm going to investigate this

OK, but this is a VERY big exception! Almost any non-string data
anyone wants to hash will be a multiple of 4 bytes in length.

> //===-- llvm/ADT/Hashing.h - Utilities for hashing --------------*- C++ -*-===//

Doesn't this belong in ../Support/ ? No-one's ever explained the
difference to me; I just assumed that generic data types go in ADT/
and everything else goes in Support/.

> //
> //                     The LLVM Compiler Infrastructure
> //
> // This file is distributed under the University of Illinois Open Source
> // License. See LICENSE.TXT for details.
> //
> //===----------------------------------------------------------------------===//
> //
> // This file implements the newly proposed standard C++ interfaces for hashing
> // arbitrary data and building hash functions for user-defined types. This
> // interface was originally proposed in N3333[1] and is currently under review
> // for inclusion in a future TR and/or standard.
> //
> // The primary interfaces provide are comprised of one type and three functions:

"provided"

> //
> //  -- 'hash_code' class is an opaque type representing the hash code for some
> //     data. It is the intended product of hashing, and can be used to implement
> //     hash tables, checksumming, and other common uses of hashes. It is not an
> //     integer type (although it can be converted to one) because it is risky
> //     to assume much about the internals of a hash_code. In particular, each
> //     execution of the program has a high probability of producing a different
> //     hash_code for a given input. Thus their values are not stable to save or
> //     persist, and should only be used during the execution for the
> //     construction of hashing datastructures.
> //
> //  -- 'hash_value' is a function designed to be overloaded for each
> //  user-defined type which wishes to be used within a hashing context. It
> //  should be overloaded within the user-defined type's namespace and found via
> //  ADL. Overloads for primitive types are provided by this library.
> //
> //  -- 'hash_combine' and 'hash_combine_range' are functions designed to aid
> //  programmers in easily and intuitively combining a set of data into a single
> //  hash_code for their object. They should only logically be used within the
> //  implementation of a 'hash_value' routine or similar context.
> //
> //  Note that 'hash_combine_range' contains very special logic for hashing
> //  a contiguous array of integers or pointers. This logic is *extremely* fast,
> //  on a modern Intel "Gainestown" Xeon (Nehalem uarch) @2.2 GHz, these were
> //  benchmarked at over 8.5 GiB/s for large keys, and <20 cycles/

20 cycles per what? Don't keep us in suspense!

> //
> //===----------------------------------------------------------------------===//
>
> #ifndef LLVM_ADT_HASHING_H
> #define LLVM_ADT_HASHING_H
>
> #include "llvm/ADT/STLExtras.h"
> #include "llvm/Support/DataTypes.h"
> #include "llvm/Support/type_traits.h"
> #include <cassert>
> #include <cstring>
> #include <algorithm>
> #include <utility>
> #include <iterator>
>
> namespace llvm {
>
> /// \brief An opaque object representing a hash code.
> ///
> /// This object represents the result of hashing some entity. It is intended to
> /// be used to implement hashtables or other hashing-based data structures.
> /// While it wraps and exposes a numeric value, this value should not be
> /// trusted to be stable or predictable across processes or executions.
> ///
> /// In order to obtain the hash_code for an object 'x':
> /// \code
> ///   using llvm::hash_value;
> ///   llvm::hash_code code = hash_value(x);
> /// \endcode
> ///
> /// Also note that there are two numerical values which are reserved, and the
> /// implementation ensures will never be produced for real hash_codes. These
> /// can be used as sentinels within hashing data structures.
> class hash_code {
>   size_t value;
>
> public:
>   /// \brief Default construct a hash_code. Constructs a null code.
>   hash_code() : value() {}
>
>   /// \brief Form a hash code directly from a numerical value.
>   hash_code(size_t value) : value(value) {}
>
>   /// \brief Convert the hash code to its numerical value for use.
>   /*explicit*/ operator size_t() const { return value; }
>
>   /// \brief Get a hash_code object which corresponds to a null code.
>   ///
>   /// The null code must never be the result of any 'hash_value' calls and can
>   /// be used to detect an unset hash_code.
>   static hash_code get_null_code() { return 0; }
>
>   /// \brief Get a hash_code object which corresponds to an invalid code.
>   ///
>   /// The invalid code must never be the result of any 'hash_value' calls. This
>   /// can be used to flag invalid hash_codes or mark entries in a hash table.
>   static hash_code get_invalid_code() { return -1; }

I'm sure there's a compiler out there just waiting to warn you that
you're implicitly converting -1 to an unsigned type. :-)

>   friend bool operator==(const hash_code &lhs, const hash_code &rhs) {
>     return lhs.value == rhs.value;
>   }
>   friend bool operator!=(const hash_code &lhs, const hash_code &rhs) {
>     return lhs.value != rhs.value;
>   }
>
>   /// \brief Allow a hash_code to be directly run through hash_value.
>   friend size_t hash_value(const hash_code &code) { return code.value; }
> };
>
>
> // All of the implementation details of actually computing the various hash
> // code values are held within this namespace. These routines are included in
> // the header file mainly to allow inlining and constant propagation.
> namespace hashing { namespace detail {

My only comment on the implementation is that, on CPUs that have
different instruction sequences for aligned and unaligned loads, you
want to be careful that the 8-byte memcpys turn into aligned loads
when hashing naturally aligned data. (But if we only care about
running on x86-64, this won't be a problem.)

Jay.