[LLVMdev] ADT/Hashing.h on 32-bit platforms
Stephan Tolksdorf
st at quanttec.com
Mon Feb 3 13:46:15 PST 2014
On 02.02.14 00:48, Chandler Carruth wrote:
> On Sat, Feb 1, 2014 at 8:35 AM, Stephan Tolksdorf <st at quanttec.com
> <mailto:st at quanttec.com>> wrote:
>
> Hi,
>
> Currently the hashing implementation in ADT/Hashing.h produces hash
> values on 32-bit platforms that differ from the lower 32-bits of the
> hash values produced on 64-bit platforms. It seems the only reason
> for this difference is that the uint64_t integer seed is truncated
> to size_t. Since the usage of uint64_t and size_t as types for seed
> values in the implementation is somewhat inconsistent, I'm
> wondering: Was this difference deliberately introduced as a
> performance optimization for 32-bit code, or is this maybe just an
> implementation artifact?
>
> I also noted that the implementation relies on implicit conversions
> from uint64_t to size_t in several places where hash_code values are
> returned, which may trigger compiler warnings on 32-bit platforms.
>
>
> Almost all of this was largely incidental on my part. I didn't spend a
> lot of time tuning the 32-bit platform performance. If you have ideas
> that you think would make things better, I'm certainly interested. =D
>
> One thing I would note is that I would prefer to retain the collision
> resistance where possible. So I would be leery of reducing the internal
> state used. I would more focus on making the outputs friendly for 32-bit
> platforms and the inputs consistent.
I've attached a patch that makes the hashing implementation consistently
use a 64-bit seed everywhere. With this change the implementation
should produce the same hash codes modulo SIZE_MAX + 1 for identical
values independent of the platform. (Though hash_combine may still
produce different results if the size of an argument type differs
between platforms). I suspect the negative performance impact on 32-bit
platforms should be small, but I didn't do any benchmarking. With
atomics one could probably replace the thread safe local static in
get_execution_seed with something that has a little less overhead.
The patch also removes a FIXME from the set_fixed_execution_seed
implementation and rewords the documentation string a bit, since using
this function in a multi-threaded program requires some kind of external
synchronization anyway.
Another attached patch implements llvm::sys::Process::GetRandomNumber()
on Windows. (There's already a Unix implementation.) This function could
be useful for implementing the missing random per-execution seed
functionality. (Though it might take a little care to deal with
Process::GetRandomNumberSeed potentially calling back into hash_combine
on Unix.)
In a little experiment I made, changing the seed per execution seemed to
lead to test suite errors for clang's PCH support and other components.
- Stephan
-------------- next part --------------
From 2a7548a5dd0c1a63e3b8b15273beab65cb8e139a Mon Sep 17 00:00:00 2001
From: Stephan Tolksdorf <st at quanttec.com>
Date: Mon, 3 Feb 2014 21:25:00 +0100
Subject: [PATCH 1/3] Use 64-bit hash seed on 32-bit platforms
With this change the hash_value funtion should produce the same results
(modulo MAX_SIZE + 1) for integers on all platforms.
---
include/llvm/ADT/Hashing.h | 46 +++++++++++++++++++---------------
lib/Support/Hashing.cpp | 5 ++--
unittests/ADT/HashingTest.cpp | 57 -------------------------------------------
3 files changed, 28 insertions(+), 80 deletions(-)
diff --git a/include/llvm/ADT/Hashing.h b/include/llvm/ADT/Hashing.h
index e434417..4976808 100644
--- a/include/llvm/ADT/Hashing.h
+++ b/include/llvm/ADT/Hashing.h
@@ -65,14 +65,14 @@
namespace llvm {
/// \brief An opaque object representing a hash code.
///
/// This object represents the result of hashing some entity. It is intended to
/// be used to implement hashtables or other hashing-based data structures.
-/// While it wraps and exposes a numeric value, this value should not be
-/// trusted to be stable or predictable across processes or executions.
+/// While it wraps and exposes a numeric value of type size_t, this value should
+/// not be trusted to be stable or predictable across processes or executions.
///
/// In order to obtain the hash_code for an object 'x':
/// \code
/// using llvm::hash_value;
/// llvm::hash_code code = hash_value(x);
/// \endcode
@@ -82,13 +82,15 @@ class hash_code {
public:
/// \brief Default construct a hash_code.
/// Note that this leaves the value uninitialized.
hash_code() {}
/// \brief Form a hash code directly from a numerical value.
- hash_code(size_t value) : value(value) {}
+ ///
+ /// The argument is casted to and stored as a size_t value.
+ hash_code(uint64_t value) : value(static_cast<size_t>(value)) {}
/// \brief Convert the hash code to its numerical value for use.
/*explicit*/ operator size_t() const { return value; }
friend bool operator==(const hash_code &lhs, const hash_code &rhs) {
return lhs.value == rhs.value;
@@ -125,24 +127,26 @@ template <typename T>
hash_code hash_value(const std::basic_string<T> &arg);
/// \brief Override the execution seed with a fixed value.
///
/// This hashing library uses a per-execution seed designed to change on each
-/// run with high probability in order to ensure that the hash codes are not
-/// attackable and to ensure that output which is intended to be stable does
-/// not rely on the particulars of the hash codes produced.
+/// run with high probability in order to improve the protection against hash
+/// collision DoS attacks and to ensure that output which is intended to be
+/// stable does not rely on the particulars of the hash codes produced.
///
/// That said, there are use cases where it is important to be able to
/// reproduce *exactly* a specific behavior. To that end, we provide a function
/// which will forcibly set the seed to a fixed value. This must be done at the
-/// start of the program, before any hashes are computed. Also, it cannot be
-/// undone. This makes it thread-hostile and very hard to use outside of
-/// immediately on start of a simple program designed for reproducible
-/// behavior.
-void set_fixed_execution_hash_seed(size_t fixed_value);
+/// start of the program, before any hashes are computed. Also, it can only be
+/// done once and not be undone. This makes it very hard to use outside of
+/// immediately on start of a simple program designed for reproducible behavior.
+///
+/// This function is not thread safe. A call to this function must be externally
+/// synchronized such that it happens before any call to a hashing function.
+void set_fixed_execution_hash_seed(uint64_t fixed_value);
// All of the implementation details of actually computing the various hash
// code values are held within this namespace. These routines are included in
// the header file mainly to allow inlining and constant propagation.
namespace hashing {
@@ -320,24 +324,24 @@ struct hash_state {
/// \brief A global, fixed seed-override variable.
///
/// This variable can be set using the \see llvm::set_fixed_execution_seed
/// function. See that function for details. Do not, under any circumstances,
/// set or read this variable.
-extern size_t fixed_seed_override;
+extern uint64_t fixed_seed_override;
-inline size_t get_execution_seed() {
+inline uint64_t get_execution_seed() {
// FIXME: This needs to be a per-execution seed. This is just a placeholder
// implementation. Switching to a per-execution seed is likely to flush out
// instability bugs and so will happen as its own commit.
//
// However, if there is a fixed seed override set the first time this is
// called, return that instead of the per-execution seed.
const uint64_t seed_prime = 0xff51afd7ed558ccdULL;
- static size_t seed = fixed_seed_override ? fixed_seed_override
- : (size_t)seed_prime;
+ static uint64_t seed = fixed_seed_override ? fixed_seed_override
+ : seed_prime;
return seed;
}
/// \brief Trait to indicate whether a type's bits can be hashed directly.
///
@@ -406,13 +410,13 @@ bool store_and_advance(char *&buffer_ptr, char *buffer_end, const T& value,
///
/// This overload is selected when the value type of the iterator is
/// integral. Rather than computing a hash_code for each object and then
/// combining them, this (as an optimization) directly combines the integers.
template <typename InputIteratorT>
hash_code hash_combine_range_impl(InputIteratorT first, InputIteratorT last) {
- const size_t seed = get_execution_seed();
+ const uint64_t seed = get_execution_seed();
char buffer[64], *buffer_ptr = buffer;
char *const buffer_end = buffer_ptr + array_lengthof(buffer);
while (first != last && store_and_advance(buffer_ptr, buffer_end,
get_hashable_data(*first)))
++first;
if (first == last)
@@ -450,21 +454,21 @@ hash_code hash_combine_range_impl(InputIteratorT first, InputIteratorT last) {
/// optimization) directly combines the integers. Also, because the integers
/// are stored in contiguous memory, this routine avoids copying each value
/// and directly reads from the underlying memory.
template <typename ValueT>
typename enable_if<is_hashable_data<ValueT>, hash_code>::type
hash_combine_range_impl(ValueT *first, ValueT *last) {
- const size_t seed = get_execution_seed();
+ const uint64_t seed = get_execution_seed();
const char *s_begin = reinterpret_cast<const char *>(first);
const char *s_end = reinterpret_cast<const char *>(last);
const size_t length = std::distance(s_begin, s_end);
if (length <= 64)
return hash_short(s_begin, length, seed);
const char *s_aligned_end = s_begin + (length & ~63);
- hash_state state = state.create(s_begin, seed);
+ hash_state state = hash_state::create(s_begin, seed);
s_begin += 64;
while (s_begin != s_aligned_end) {
state.mix(s_begin);
s_begin += 64;
}
if (length & 63)
@@ -500,13 +504,13 @@ namespace detail {
/// recursive combining of arguments used in hash_combine. It is particularly
/// useful at minimizing the code in the recursive calls to ease the pain
/// caused by a lack of variadic functions.
struct hash_combine_recursive_helper {
char buffer[64];
hash_state state;
- const size_t seed;
+ const uint64_t seed;
public:
/// \brief Construct a recursive hash combining helper.
///
/// This sets up the state for a recursive hash combine, including getting
/// the seed and buffer setup.
@@ -733,13 +737,15 @@ inline hash_code hash_integer_value(uint64_t value) {
// Declared and documented above, but defined here so that any of the hashing
// infrastructure is available.
template <typename T>
typename enable_if<is_integral_or_enum<T>, hash_code>::type
hash_value(T value) {
- return ::llvm::hashing::detail::hash_integer_value(value);
+ // the cast is necessary for strong enums and avoids compiler warnings
+ const uint64_t n = static_cast<uint64_t>(value);
+ return ::llvm::hashing::detail::hash_integer_value(n);
}
// Declared and documented above, but defined here so that any of the hashing
// infrastructure is available.
template <typename T> hash_code hash_value(const T *ptr) {
return ::llvm::hashing::detail::hash_integer_value(
diff --git a/lib/Support/Hashing.cpp b/lib/Support/Hashing.cpp
index c69efb7..5fa4546 100644
--- a/lib/Support/Hashing.cpp
+++ b/lib/Support/Hashing.cpp
@@ -17,13 +17,12 @@
using namespace llvm;
// Provide a definition and static initializer for the fixed seed. This
// initializer should always be zero to ensure its value can never appear to be
// non-zero, even during dynamic initialization.
-size_t llvm::hashing::detail::fixed_seed_override = 0;
+uint64_t llvm::hashing::detail::fixed_seed_override = 0;
// Implement the function for forced setting of the fixed seed.
-// FIXME: Use atomic operations here so that there is no data race.
-void llvm::set_fixed_execution_hash_seed(size_t fixed_value) {
+void llvm::set_fixed_execution_hash_seed(uint64_t fixed_value) {
hashing::detail::fixed_seed_override = fixed_value;
}
diff --git a/unittests/ADT/HashingTest.cpp b/unittests/ADT/HashingTest.cpp
index 1b3d061..d14bd5d 100644
--- a/unittests/ADT/HashingTest.cpp
+++ b/unittests/ADT/HashingTest.cpp
@@ -206,13 +206,12 @@ TEST(HashingTest, HashCombineRangeLengthDiff) {
EXPECT_EQ(Idx, I->second);
}
}
TEST(HashingTest, HashCombineRangeGoldenTest) {
struct { const char *s; uint64_t hash; } golden_data[] = {
-#if SIZE_MAX == UINT64_MAX
{ "a", 0xaeb6f9d5517c61f8ULL },
{ "ab", 0x7ab1edb96be496b4ULL },
{ "abc", 0xe38e60bf19c71a3fULL },
{ "abcde", 0xd24461a66de97f6eULL },
{ "abcdefgh", 0x4ef872ec411dec9dULL },
{ "abcdefghijklm", 0xe8a865539f4eadfeULL },
@@ -259,68 +258,12 @@ TEST(HashingTest, HashCombineRangeGoldenTest) {
{ "abababababababababababababababab", 0xa75cd6afbf1bc972ULL },
{ "abababababababababababababababab"
"abababababababababababababababab"
"abababababababababababababababab"
"abababababababababababababababab"
"abababababababababababababababab", 0x840192d129f7a22bULL }
-#elif SIZE_MAX == UINT32_MAX
- { "a", 0x000000004605f745ULL },
- { "ab", 0x00000000d5f06301ULL },
- { "abc", 0x00000000559fe1eeULL },
- { "abcde", 0x00000000424028d7ULL },
- { "abcdefgh", 0x000000007bb119f8ULL },
- { "abcdefghijklm", 0x00000000edbca513ULL },
- { "abcdefghijklmnopqrstu", 0x000000007c15712eULL },
- { "abcdefghijklmnopqrstuvwxyzabcdef", 0x000000000b3aad66ULL },
- { "abcdefghijklmnopqrstuvwxyzabcdef"
- "abcdefghijklmnopqrstuvwxyzghijkl"
- "abcdefghijklmnopqrstuvwxyzmnopqr"
- "abcdefghijklmnopqrstuvwxyzstuvwx"
- "abcdefghijklmnopqrstuvwxyzyzabcd", 0x000000008c758c8bULL },
- { "a", 0x000000004605f745ULL },
- { "aa", 0x00000000dc0a52daULL },
- { "aaa", 0x00000000b309274fULL },
- { "aaaaa", 0x00000000203b5ef6ULL },
- { "aaaaaaaa", 0x00000000a429e18fULL },
- { "aaaaaaaaaaaaa", 0x000000008662070bULL },
- { "aaaaaaaaaaaaaaaaaaaaa", 0x000000003f11151cULL },
- { "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", 0x000000008600fe20ULL },
- { "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
- "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
- "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
- "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
- "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", 0x000000004e0e0804ULL },
- { "z", 0x00000000c5e405e9ULL },
- { "zz", 0x00000000a8d8a2c6ULL },
- { "zzz", 0x00000000fc2af672ULL },
- { "zzzzz", 0x0000000047d9efe6ULL },
- { "zzzzzzzz", 0x0000000080d77794ULL },
- { "zzzzzzzzzzzzz", 0x00000000405f93adULL },
- { "zzzzzzzzzzzzzzzzzzzzz", 0x00000000fc72838dULL },
- { "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", 0x000000007ce160f1ULL },
- { "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
- "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
- "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
- "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
- "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", 0x00000000aed9ed1bULL },
- { "a", 0x000000004605f745ULL },
- { "ab", 0x00000000d5f06301ULL },
- { "aba", 0x00000000a85cd91bULL },
- { "ababa", 0x000000009e3bb52eULL },
- { "abababab", 0x000000002709b3b9ULL },
- { "ababababababa", 0x000000003a234174ULL },
- { "ababababababababababa", 0x000000005c63e5ceULL },
- { "abababababababababababababababab", 0x0000000013f74334ULL },
- { "abababababababababababababababab"
- "abababababababababababababababab"
- "abababababababababababababababab"
- "abababababababababababababababab"
- "abababababababababababababababab", 0x00000000c1a6f135ULL },
-#else
-#error This test only supports 64-bit and 32-bit systems.
-#endif
};
for (unsigned i = 0; i < sizeof(golden_data)/sizeof(*golden_data); ++i) {
StringRef str = golden_data[i].s;
hash_code hash = hash_combine_range(str.begin(), str.end());
#if 0 // Enable this to generate paste-able text for the above structure.
std::string member_str = "\"" + str.str() + "\",";
--
1.7.11.msysgit.1
-------------- next part --------------
From b2976da4fc38c9fb41498915af3d6eee8bac7267 Mon Sep 17 00:00:00 2001
From: Stephan Tolksdorf <st at quanttec.com>
Date: Mon, 3 Feb 2014 21:25:01 +0100
Subject: [PATCH 2/3] Implement llvm::sys::Process::GetRandomNumber() on
Windows
This function was already implemented for Unix. The implementation uses the
rand_s function, which according to MSDN is available on Windows XP and later.
libc++ also uses rand_s for implementing random_device. I've checked
that mingw-w64 supports rand_s.
---
lib/Support/Process.cpp | 4 ++++
lib/Support/Windows/Process.inc | 7 +++++++
unittests/Support/ProcessTest.cpp | 6 ++++++
3 files changed, 17 insertions(+)
diff --git a/lib/Support/Process.cpp b/lib/Support/Process.cpp
index d5168f0..009c7c8 100644
--- a/lib/Support/Process.cpp
+++ b/lib/Support/Process.cpp
@@ -9,12 +9,16 @@
//
// This header file implements the operating system Process concept.
//
//===----------------------------------------------------------------------===//
#include "llvm/Config/config.h"
+#if LLVM_ON_WIN32
+# define _CRT_RAND_S // makes stdlib.h declare the rand_s function
+# include <stdlib.h>
+#endif
#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/Process.h"
using namespace llvm;
using namespace sys;
diff --git a/lib/Support/Windows/Process.inc b/lib/Support/Windows/Process.inc
index 750097e..73ce97a 100644
--- a/lib/Support/Windows/Process.inc
+++ b/lib/Support/Windows/Process.inc
@@ -357,6 +357,13 @@ const char *Process::OutputReverse() {
const char *Process::ResetColor() {
if (UseANSI) return "\033[0m";
SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE), defaultColors());
return 0;
}
+
+unsigned Process::GetRandomNumber() {
+ unsigned result;
+ const errno_t ec = rand_s(&result); // supported in Win XP and later
+ assert(ec == 0 && "rand_s failed");
+ return result;
+}
diff --git a/unittests/Support/ProcessTest.cpp b/unittests/Support/ProcessTest.cpp
index af6a6f9..387d5d3 100644
--- a/unittests/Support/ProcessTest.cpp
+++ b/unittests/Support/ProcessTest.cpp
@@ -36,12 +36,18 @@ TEST(ProcessTest, SelfProcess) {
EXPECT_LT(TimeValue::MinTime, process::get_self()->get_system_time());
EXPECT_GT(TimeValue::MaxTime, process::get_self()->get_system_time());
EXPECT_LT(TimeValue::MinTime, process::get_self()->get_wall_time());
EXPECT_GT(TimeValue::MaxTime, process::get_self()->get_wall_time());
}
+TEST(ProcessTest, GetRandomNumberTest) {
+ const unsigned r1 = Process::GetRandomNumber();
+ const unsigned r2 = Process::GetRandomNumber();
+ EXPECT_NE((r1 | r2), 0); // 0 should be extremely unlikely
+}
+
#ifdef _MSC_VER
#define setenv(name, var, ignore) _putenv_s(name, var)
#endif
#if HAVE_SETENV || _MSC_VER
TEST(ProcessTest, Basic) {
--
1.7.11.msysgit.1
More information about the llvm-dev
mailing list