[cfe-dev] [libc++] ext/hash function

Sun Jul 31 10:44:16 PDT 2011

I would like to get everyone's opinion on how ext/hash_set/map should work.

Background:  __gnu_cxx::hash_map, __gnu_cxx::hash_multimap, __gnu_cxx::hash_set, and __gnu_cxx::hash_multiset are extension containers introduced to ease migration from gcc's libstdc++.  They are not complete re-implementations of the hash containers.  They are simply adaptors around the same std::__hash_table that powers the std::unordered containers.  They are *similar to* the __gnu_cxx hash containers.

Clarification:  I don't really want to discuss whether these extensions should exist or not.  That decision has already been made, and it can not be reversed at this point.

What I would like to discuss is:  how similar should they be made to the the __gnu_cxx hash containers?  The __gnu_cxx hash containers were purposefully not standardized without change because the committee saw defects in their behavior.  Should our "migration adaptors" go to extra effort to replicate such defects in the __gnu_cxx hash containers?

For example, Matt Austern writes in N1456 (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1456.html):

> Some earlier hash table implementations gave char* special treatment: it specialized the default hash function to look at character array being pointed to, rather than the pointer itself. This proposal removes that special treatment. Special treatment makes it slightly easier to use hash tables for C string, but at the cost of removing uniformity and making it harder to write generic code. Since naive users would generally be expected to use std::basic_string instead of C strings, the cost of special treatment outweighs the benefit.

And indeed, std::hash<char*> today is a direct result of Matt's observation from 8 years ago.  The char* is hashed as a pointer, not as a null terminated C string.

The actual __gnu_cxx hash containers however /did/ hash a char* as a null terminated null C string.

Originally the libc++ adaptors which emulate the __gnu_cxx hash containers simply reused std::hash<char*>, and thus hashed the Matt proposed so long ago.  However recently we've changed our adaptors to more closely emulate the actual behavior of the __gnu_cxx hash containers.  This can be detected (for example) with this program:

#include <ext/hash_set>

int main()
{
    __gnu_cxx::hash_set<char*> s;
    s.insert(nullptr);
}

If we have the adaptors just use std::hash, this program runs ok.  If we emulate the old __gnu_cxx::hash function, this program crashes (which is an accurate emulation).

This one also may crash spuriously:

#include <ext/hash_set>

char message[2] = {'a', 'b'};

int main()
{
    __gnu_cxx::hash_set<char*> s;
    s.insert(message);
}

(lack of null termination)

For those customers of ours who use these adaptors, which is more valuable:  A little more safety, or a little more accurate emulation?

I should point out that to create a completely accurate emulation will require a completely new implementation.  std::__hash_table is a completely different data structure than that underlying the actual __gnu_cxx hash containers.  For example:

#include <ext/hash_set>
#include <iostream>

int main()
{
    __gnu_cxx::hash_set<int> s;
    s.insert(3);
    s.insert(7);
    s.insert(1);
    s.insert(5);
    for (__gnu_cxx::hash_set<int>::const_iterator i = s.begin(); i != s.end(); ++i)
        std::cout << *i << '\n';
}

libc++ outputs:

5
1
7
3

libstdc++ outputs:

1
3
5
7

(both libs use the identity for hash<int>, so that isn't the cause of this difference)

Thanks for any discussion/opinions.

Howard