[PATCH] D42740: Implement a case-folding version of DJB hash

Pavel Labath via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Wed Jan 31 08:07:40 PST 2018


labath created this revision.
labath added reviewers: JDevlieghere, aprantl, probinson, dblaikie.
Herald added subscribers: hintonda, mgorny.

This patch implements a variant of the DJB hash function which folds the
input according to the algorithm in the Dwarf 5 specification (Section
6.1.1.4.5), which in turn references the Unicode Standard (Section 5.18,
"Case Mappings").

To achieve this, I have added a llvm::sys::unicode::foldCharSimple
function, which performs this mapping. The implementation of this
function was generated from the CaseMatching.txt file from the Unicode
spec using a python script (which is also included in this patch). The
script tries to optimize the function by coalescing adjecant mappings
with the same shift and stride (terms I made up). Theoretically, it
could be made a bit smarter and merge adjecant blocks that were
interrupted by only one or two characters with exceptional mapping, but
this would save only a couple of branches, while it would greatly
complicate the implementation, so I deemed it was not worth it.

Since we assume that the vast majority of the input characters will be
US-ASCII, the folding hash function has a fast-path for handling these,
and only whips out the full decode+fold+encode logic if we encounter a
character outside of this range. It might be possible to implement the
folding directly on utf8 sequences, but this would also bring a lot of
complexity for the few cases where we will actually need to process
non-ascii characters.


Repository:
  rL LLVM

https://reviews.llvm.org/D42740

Files:
  include/llvm/Support/DJB.h
  include/llvm/Support/Unicode.h
  lib/Support/CMakeLists.txt
  lib/Support/DJB.cpp
  lib/Support/UnicodeCaseFold.cpp
  unittests/Support/CMakeLists.txt
  unittests/Support/DJBTest.cpp
  utils/dwarf-case-fold.py

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D42740.132178.patch
Type: text/x-patch
Size: 25388 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20180131/dad2075e/attachment.bin>


More information about the llvm-commits mailing list