[LLVMdev] ADT/Hashing.h on 32-bit platforms

Stephan Tolksdorf st at quanttec.com
Mon Feb 3 13:46:15 PST 2014


On 02.02.14 00:48, Chandler Carruth wrote:
> On Sat, Feb 1, 2014 at 8:35 AM, Stephan Tolksdorf <st at quanttec.com
> <mailto:st at quanttec.com>> wrote:
>
>     Hi,
>
>     Currently the hashing implementation in ADT/Hashing.h produces hash
>     values on 32-bit platforms that differ from the lower 32-bits of the
>     hash values produced on 64-bit platforms. It seems the only reason
>     for this difference is that the uint64_t integer seed is truncated
>     to size_t. Since the usage of uint64_t and size_t as types for seed
>     values in the implementation is somewhat inconsistent, I'm
>     wondering: Was this difference deliberately introduced as a
>     performance optimization for 32-bit code, or is this maybe just an
>     implementation artifact?
>
>     I also noted that the implementation relies on implicit conversions
>     from uint64_t to size_t in several places where hash_code values are
>     returned, which may trigger compiler warnings on 32-bit platforms.
>
>
> Almost all of this was largely incidental on my part. I didn't spend a
> lot of time tuning the 32-bit platform performance. If you have ideas
> that you think would make things better, I'm certainly interested. =D
>
> One thing I would note is that I would prefer to retain the collision
> resistance where possible. So I would be leery of reducing the internal
> state used. I would more focus on making the outputs friendly for 32-bit
> platforms and the inputs consistent.

I've attached a patch that makes the hashing implementation consistently 
use a 64-bit seed everywhere.  With this change the implementation 
should produce the same hash codes modulo SIZE_MAX + 1 for identical 
values independent of the platform. (Though hash_combine may still 
produce different results if the size of an argument type differs 
between platforms). I suspect the negative performance impact on 32-bit 
platforms should be small, but I didn't do any benchmarking. With 
atomics one could probably replace the thread safe local static in 
get_execution_seed with something that has a little less overhead.

The patch also removes a FIXME from the set_fixed_execution_seed 
implementation and rewords the documentation string a bit, since using 
this function in a multi-threaded program requires some kind of external 
synchronization anyway.

Another attached patch implements llvm::sys::Process::GetRandomNumber() 
on Windows. (There's already a Unix implementation.) This function could 
be useful for implementing the missing random per-execution seed 
functionality. (Though it might take a little care to deal with 
Process::GetRandomNumberSeed potentially calling back into hash_combine 
on Unix.)

In a little experiment I made, changing the seed per execution seemed to 
lead to test suite errors for clang's PCH support and other components.

- Stephan

-------------- next part --------------
From 2a7548a5dd0c1a63e3b8b15273beab65cb8e139a Mon Sep 17 00:00:00 2001
From: Stephan Tolksdorf <st at quanttec.com>
Date: Mon, 3 Feb 2014 21:25:00 +0100
Subject: [PATCH 1/3] Use 64-bit hash seed on 32-bit platforms

With this change the hash_value funtion should produce the same results
(modulo MAX_SIZE + 1) for integers on all platforms.
---
 include/llvm/ADT/Hashing.h    | 46 +++++++++++++++++++---------------
 lib/Support/Hashing.cpp       |  5 ++--
 unittests/ADT/HashingTest.cpp | 57 -------------------------------------------
 3 files changed, 28 insertions(+), 80 deletions(-)

diff --git a/include/llvm/ADT/Hashing.h b/include/llvm/ADT/Hashing.h
index e434417..4976808 100644
--- a/include/llvm/ADT/Hashing.h
+++ b/include/llvm/ADT/Hashing.h
@@ -65,14 +65,14 @@
 namespace llvm {
 
 /// \brief An opaque object representing a hash code.
 ///
 /// This object represents the result of hashing some entity. It is intended to
 /// be used to implement hashtables or other hashing-based data structures.
-/// While it wraps and exposes a numeric value, this value should not be
-/// trusted to be stable or predictable across processes or executions.
+/// While it wraps and exposes a numeric value of type size_t, this value should 
+/// not be trusted to be stable or predictable across processes or executions.
 ///
 /// In order to obtain the hash_code for an object 'x':
 /// \code
 ///   using llvm::hash_value;
 ///   llvm::hash_code code = hash_value(x);
 /// \endcode
@@ -82,13 +82,15 @@ class hash_code {
 public:
   /// \brief Default construct a hash_code.
   /// Note that this leaves the value uninitialized.
   hash_code() {}
 
   /// \brief Form a hash code directly from a numerical value.
-  hash_code(size_t value) : value(value) {}
+  ///
+  /// The argument is casted to and stored as a size_t value.
+  hash_code(uint64_t value) : value(static_cast<size_t>(value)) {}
 
   /// \brief Convert the hash code to its numerical value for use.
   /*explicit*/ operator size_t() const { return value; }
 
   friend bool operator==(const hash_code &lhs, const hash_code &rhs) {
     return lhs.value == rhs.value;
@@ -125,24 +127,26 @@ template <typename T>
 hash_code hash_value(const std::basic_string<T> &arg);
 
 
 /// \brief Override the execution seed with a fixed value.
 ///
 /// This hashing library uses a per-execution seed designed to change on each
-/// run with high probability in order to ensure that the hash codes are not
-/// attackable and to ensure that output which is intended to be stable does
-/// not rely on the particulars of the hash codes produced.
+/// run with high probability in order to improve the protection against hash
+/// collision DoS attacks and to ensure that output which is intended to be
+/// stable does not rely on the particulars of the hash codes produced.
 ///
 /// That said, there are use cases where it is important to be able to
 /// reproduce *exactly* a specific behavior. To that end, we provide a function
 /// which will forcibly set the seed to a fixed value. This must be done at the
-/// start of the program, before any hashes are computed. Also, it cannot be
-/// undone. This makes it thread-hostile and very hard to use outside of
-/// immediately on start of a simple program designed for reproducible
-/// behavior.
-void set_fixed_execution_hash_seed(size_t fixed_value);
+/// start of the program, before any hashes are computed. Also, it can only be
+/// done once and not be undone. This makes it very hard to use outside of
+/// immediately on start of a simple program designed for reproducible behavior.
+///
+/// This function is not thread safe. A call to this function must be externally
+/// synchronized such that it happens before any call to a hashing function.
+void set_fixed_execution_hash_seed(uint64_t fixed_value);
 
 
 // All of the implementation details of actually computing the various hash
 // code values are held within this namespace. These routines are included in
 // the header file mainly to allow inlining and constant propagation.
 namespace hashing {
@@ -320,24 +324,24 @@ struct hash_state {
 
 /// \brief A global, fixed seed-override variable.
 ///
 /// This variable can be set using the \see llvm::set_fixed_execution_seed
 /// function. See that function for details. Do not, under any circumstances,
 /// set or read this variable.
-extern size_t fixed_seed_override;
+extern uint64_t fixed_seed_override;
 
-inline size_t get_execution_seed() {
+inline uint64_t get_execution_seed() {
   // FIXME: This needs to be a per-execution seed. This is just a placeholder
   // implementation. Switching to a per-execution seed is likely to flush out
   // instability bugs and so will happen as its own commit.
   //
   // However, if there is a fixed seed override set the first time this is
   // called, return that instead of the per-execution seed.
   const uint64_t seed_prime = 0xff51afd7ed558ccdULL;
-  static size_t seed = fixed_seed_override ? fixed_seed_override
-                                           : (size_t)seed_prime;
+  static uint64_t seed = fixed_seed_override ? fixed_seed_override
+                                             : seed_prime;
   return seed;
 }
 
 
 /// \brief Trait to indicate whether a type's bits can be hashed directly.
 ///
@@ -406,13 +410,13 @@ bool store_and_advance(char *&buffer_ptr, char *buffer_end, const T& value,
 ///
 /// This overload is selected when the value type of the iterator is
 /// integral. Rather than computing a hash_code for each object and then
 /// combining them, this (as an optimization) directly combines the integers.
 template <typename InputIteratorT>
 hash_code hash_combine_range_impl(InputIteratorT first, InputIteratorT last) {
-  const size_t seed = get_execution_seed();
+  const uint64_t seed = get_execution_seed();
   char buffer[64], *buffer_ptr = buffer;
   char *const buffer_end = buffer_ptr + array_lengthof(buffer);
   while (first != last && store_and_advance(buffer_ptr, buffer_end,
                                             get_hashable_data(*first)))
     ++first;
   if (first == last)
@@ -450,21 +454,21 @@ hash_code hash_combine_range_impl(InputIteratorT first, InputIteratorT last) {
 /// optimization) directly combines the integers. Also, because the integers
 /// are stored in contiguous memory, this routine avoids copying each value
 /// and directly reads from the underlying memory.
 template <typename ValueT>
 typename enable_if<is_hashable_data<ValueT>, hash_code>::type
 hash_combine_range_impl(ValueT *first, ValueT *last) {
-  const size_t seed = get_execution_seed();
+  const uint64_t seed = get_execution_seed();
   const char *s_begin = reinterpret_cast<const char *>(first);
   const char *s_end = reinterpret_cast<const char *>(last);
   const size_t length = std::distance(s_begin, s_end);
   if (length <= 64)
     return hash_short(s_begin, length, seed);
 
   const char *s_aligned_end = s_begin + (length & ~63);
-  hash_state state = state.create(s_begin, seed);
+  hash_state state = hash_state::create(s_begin, seed);
   s_begin += 64;
   while (s_begin != s_aligned_end) {
     state.mix(s_begin);
     s_begin += 64;
   }
   if (length & 63)
@@ -500,13 +504,13 @@ namespace detail {
 /// recursive combining of arguments used in hash_combine. It is particularly
 /// useful at minimizing the code in the recursive calls to ease the pain
 /// caused by a lack of variadic functions.
 struct hash_combine_recursive_helper {
   char buffer[64];
   hash_state state;
-  const size_t seed;
+  const uint64_t seed;
 
 public:
   /// \brief Construct a recursive hash combining helper.
   ///
   /// This sets up the state for a recursive hash combine, including getting
   /// the seed and buffer setup.
@@ -733,13 +737,15 @@ inline hash_code hash_integer_value(uint64_t value) {
 
 // Declared and documented above, but defined here so that any of the hashing
 // infrastructure is available.
 template <typename T>
 typename enable_if<is_integral_or_enum<T>, hash_code>::type
 hash_value(T value) {
-  return ::llvm::hashing::detail::hash_integer_value(value);
+  // the cast is necessary for strong enums and avoids compiler warnings
+  const uint64_t n = static_cast<uint64_t>(value);
+  return ::llvm::hashing::detail::hash_integer_value(n);
 }
 
 // Declared and documented above, but defined here so that any of the hashing
 // infrastructure is available.
 template <typename T> hash_code hash_value(const T *ptr) {
   return ::llvm::hashing::detail::hash_integer_value(
diff --git a/lib/Support/Hashing.cpp b/lib/Support/Hashing.cpp
index c69efb7..5fa4546 100644
--- a/lib/Support/Hashing.cpp
+++ b/lib/Support/Hashing.cpp
@@ -17,13 +17,12 @@
 
 using namespace llvm;
 
 // Provide a definition and static initializer for the fixed seed. This
 // initializer should always be zero to ensure its value can never appear to be
 // non-zero, even during dynamic initialization.
-size_t llvm::hashing::detail::fixed_seed_override = 0;
+uint64_t llvm::hashing::detail::fixed_seed_override = 0;
 
 // Implement the function for forced setting of the fixed seed.
-// FIXME: Use atomic operations here so that there is no data race.
-void llvm::set_fixed_execution_hash_seed(size_t fixed_value) {
+void llvm::set_fixed_execution_hash_seed(uint64_t fixed_value) {
   hashing::detail::fixed_seed_override = fixed_value;
 }
diff --git a/unittests/ADT/HashingTest.cpp b/unittests/ADT/HashingTest.cpp
index 1b3d061..d14bd5d 100644
--- a/unittests/ADT/HashingTest.cpp
+++ b/unittests/ADT/HashingTest.cpp
@@ -206,13 +206,12 @@ TEST(HashingTest, HashCombineRangeLengthDiff) {
     EXPECT_EQ(Idx, I->second);
   }
 }
 
 TEST(HashingTest, HashCombineRangeGoldenTest) {
   struct { const char *s; uint64_t hash; } golden_data[] = {
-#if SIZE_MAX == UINT64_MAX
     { "a",                                0xaeb6f9d5517c61f8ULL },
     { "ab",                               0x7ab1edb96be496b4ULL },
     { "abc",                              0xe38e60bf19c71a3fULL },
     { "abcde",                            0xd24461a66de97f6eULL },
     { "abcdefgh",                         0x4ef872ec411dec9dULL },
     { "abcdefghijklm",                    0xe8a865539f4eadfeULL },
@@ -259,68 +258,12 @@ TEST(HashingTest, HashCombineRangeGoldenTest) {
     { "abababababababababababababababab", 0xa75cd6afbf1bc972ULL },
     { "abababababababababababababababab"
       "abababababababababababababababab"
       "abababababababababababababababab"
       "abababababababababababababababab"
       "abababababababababababababababab", 0x840192d129f7a22bULL }
-#elif SIZE_MAX == UINT32_MAX
-    { "a",                                0x000000004605f745ULL },
-    { "ab",                               0x00000000d5f06301ULL },
-    { "abc",                              0x00000000559fe1eeULL },
-    { "abcde",                            0x00000000424028d7ULL },
-    { "abcdefgh",                         0x000000007bb119f8ULL },
-    { "abcdefghijklm",                    0x00000000edbca513ULL },
-    { "abcdefghijklmnopqrstu",            0x000000007c15712eULL },
-    { "abcdefghijklmnopqrstuvwxyzabcdef", 0x000000000b3aad66ULL },
-    { "abcdefghijklmnopqrstuvwxyzabcdef"
-      "abcdefghijklmnopqrstuvwxyzghijkl"
-      "abcdefghijklmnopqrstuvwxyzmnopqr"
-      "abcdefghijklmnopqrstuvwxyzstuvwx"
-      "abcdefghijklmnopqrstuvwxyzyzabcd", 0x000000008c758c8bULL },
-    { "a",                                0x000000004605f745ULL },
-    { "aa",                               0x00000000dc0a52daULL },
-    { "aaa",                              0x00000000b309274fULL },
-    { "aaaaa",                            0x00000000203b5ef6ULL },
-    { "aaaaaaaa",                         0x00000000a429e18fULL },
-    { "aaaaaaaaaaaaa",                    0x000000008662070bULL },
-    { "aaaaaaaaaaaaaaaaaaaaa",            0x000000003f11151cULL },
-    { "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", 0x000000008600fe20ULL },
-    { "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
-      "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
-      "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
-      "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
-      "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", 0x000000004e0e0804ULL },
-    { "z",                                0x00000000c5e405e9ULL },
-    { "zz",                               0x00000000a8d8a2c6ULL },
-    { "zzz",                              0x00000000fc2af672ULL },
-    { "zzzzz",                            0x0000000047d9efe6ULL },
-    { "zzzzzzzz",                         0x0000000080d77794ULL },
-    { "zzzzzzzzzzzzz",                    0x00000000405f93adULL },
-    { "zzzzzzzzzzzzzzzzzzzzz",            0x00000000fc72838dULL },
-    { "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", 0x000000007ce160f1ULL },
-    { "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
-      "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
-      "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
-      "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
-      "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", 0x00000000aed9ed1bULL },
-    { "a",                                0x000000004605f745ULL },
-    { "ab",                               0x00000000d5f06301ULL },
-    { "aba",                              0x00000000a85cd91bULL },
-    { "ababa",                            0x000000009e3bb52eULL },
-    { "abababab",                         0x000000002709b3b9ULL },
-    { "ababababababa",                    0x000000003a234174ULL },
-    { "ababababababababababa",            0x000000005c63e5ceULL },
-    { "abababababababababababababababab", 0x0000000013f74334ULL },
-    { "abababababababababababababababab"
-      "abababababababababababababababab"
-      "abababababababababababababababab"
-      "abababababababababababababababab"
-      "abababababababababababababababab", 0x00000000c1a6f135ULL },
-#else
-#error This test only supports 64-bit and 32-bit systems.
-#endif
   };
   for (unsigned i = 0; i < sizeof(golden_data)/sizeof(*golden_data); ++i) {
     StringRef str = golden_data[i].s;
     hash_code hash = hash_combine_range(str.begin(), str.end());
 #if 0 // Enable this to generate paste-able text for the above structure.
     std::string member_str = "\"" + str.str() + "\",";
-- 
1.7.11.msysgit.1

-------------- next part --------------
From b2976da4fc38c9fb41498915af3d6eee8bac7267 Mon Sep 17 00:00:00 2001
From: Stephan Tolksdorf <st at quanttec.com>
Date: Mon, 3 Feb 2014 21:25:01 +0100
Subject: [PATCH 2/3] Implement llvm::sys::Process::GetRandomNumber() on
 Windows

This function was already implemented for Unix. The implementation uses the
rand_s function, which according to MSDN is available on Windows XP and later.
libc++ also uses rand_s for implementing random_device. I've checked
that mingw-w64 supports rand_s.
---
 lib/Support/Process.cpp           | 4 ++++
 lib/Support/Windows/Process.inc   | 7 +++++++
 unittests/Support/ProcessTest.cpp | 6 ++++++
 3 files changed, 17 insertions(+)

diff --git a/lib/Support/Process.cpp b/lib/Support/Process.cpp
index d5168f0..009c7c8 100644
--- a/lib/Support/Process.cpp
+++ b/lib/Support/Process.cpp
@@ -9,12 +9,16 @@
 //
 //  This header file implements the operating system Process concept.
 //
 //===----------------------------------------------------------------------===//
 
 #include "llvm/Config/config.h"
+#if LLVM_ON_WIN32
+# define _CRT_RAND_S // makes stdlib.h declare the rand_s function
+# include <stdlib.h>
+#endif
 #include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/Process.h"
 
 using namespace llvm;
 using namespace sys;
 
diff --git a/lib/Support/Windows/Process.inc b/lib/Support/Windows/Process.inc
index 750097e..73ce97a 100644
--- a/lib/Support/Windows/Process.inc
+++ b/lib/Support/Windows/Process.inc
@@ -357,6 +357,13 @@ const char *Process::OutputReverse() {
 
 const char *Process::ResetColor() {
   if (UseANSI) return "\033[0m";
   SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE), defaultColors());
   return 0;
 }
+
+unsigned Process::GetRandomNumber() {
+  unsigned result;
+  const errno_t ec = rand_s(&result); // supported in Win XP and later
+  assert(ec == 0 && "rand_s failed");
+  return result;
+}
diff --git a/unittests/Support/ProcessTest.cpp b/unittests/Support/ProcessTest.cpp
index af6a6f9..387d5d3 100644
--- a/unittests/Support/ProcessTest.cpp
+++ b/unittests/Support/ProcessTest.cpp
@@ -36,12 +36,18 @@ TEST(ProcessTest, SelfProcess) {
   EXPECT_LT(TimeValue::MinTime, process::get_self()->get_system_time());
   EXPECT_GT(TimeValue::MaxTime, process::get_self()->get_system_time());
   EXPECT_LT(TimeValue::MinTime, process::get_self()->get_wall_time());
   EXPECT_GT(TimeValue::MaxTime, process::get_self()->get_wall_time());
 }
 
+TEST(ProcessTest, GetRandomNumberTest) {
+  const unsigned r1 = Process::GetRandomNumber();
+  const unsigned r2 = Process::GetRandomNumber();
+  EXPECT_NE((r1 | r2), 0); // 0 should be extremely unlikely
+}
+
 #ifdef _MSC_VER
 #define setenv(name, var, ignore) _putenv_s(name, var)
 #endif
 
 #if HAVE_SETENV || _MSC_VER
 TEST(ProcessTest, Basic) {
-- 
1.7.11.msysgit.1



More information about the llvm-dev mailing list