[PATCH] D42740: Implement a case-folding version of DJB hash

Thu Feb 1 06:18:42 PST 2018

labath marked 3 inline comments as done.
labath added a comment.

In https://reviews.llvm.org/D42740#993464, @aprantl wrote:

> Is this function something that needs to be updated with every new version of Unicode? Should we add a "How to update this for newer versions" comment?

That's a good question. The DWARF spec specifically cites the version 9 of the Unicode Standard. This was the latest version at the time dwarf v5 was published, so it's not fully clear whether they meant to hard-code the version or they just meant "the latest".

Currently, the latest version is v10 that came out in June 2017. However, it has no changes in the case folding database, so we don't have to figure this out yet.

For the future, it's not really clear to me what to do here. The Unicode spec does offer some guarantees about the backwards compatibility in regards to the case folding database (*), so updating the mappings should be kinda safe, because any new characters they add are the ones that would be invalid according to the previous specifications. OTOH, the dwarf spec is pretty explicit about the version we should use.

(*) Unicode v10, Section 5.18, "Case Mappings", page 243 http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf

> Stability. The definition of case folding is guaranteed to be stable, in that any string of
>  characters case folded according to these rules will remain case folded in Version 5.0 or later
>  of the Unicode Standard. To achieve this stability, there are constraints on additions of case
>  pairs for existing encoded characters. Typically, no new lowercase character will be added
>  to the Unicode Standard as a casing pair of an existing upper- or titlecase character that
>  does not already have a lowercase pair. In exceptional circumstances, where lowercase characters
>  must be added to the standard in a later version than the version in which the corresponding
>  uppercase characters were encoded, such lowercase characters can only be
>  defined as new case pairs with a corresponding change to case folding to ensure that they
>  case fold to the old uppercase letters. See the subsection “Policies” in Section B.3, Other Unicode
>  Online Resources.

================
Comment at: lib/Support/UnicodeCaseFold.cpp:1
+#include "llvm/Support/Unicode.h"
+
----------------
probinson wrote:
> This file should have a header.  If we expect it will always be a generated file, it should have a comment saying so and where to look for how to regenerate it.  If not, it should have a normal LLVM module header.
I think this will always remain an autogenerated file. All tweaks to the implementation would probably be easier made by modifying the script and regenerating.

I've added a header which explains this and how to regenerate it. Let me know what you think.

================
Comment at: unittests/Support/DJBTest.cpp:16
+
+TEST(DJBTest, caseFoldingDjbHash) {
+  struct TestCase {
----------------
dblaikie wrote:
> Given that this is a specific hash with a guaranteed result (for portability - debuggers are expected to have their own implementation of this hash that would produce the same values, etc) - should this have some tests for the actual hash values of some non-trivial text?
> 
> (also this test doesn't test for any hash values differing - the whole test would pass if the hash function returned a single constant for all inputs, right?)
Good point, I've added more tests, which compare hashes against fixed constants.

Repository:
  rL LLVM

https://reviews.llvm.org/D42740