[PATCH] D133715: [ADT] Add HashMappedTrie

Wed Dec 7 17:46:10 PST 2022

dexonsmith resigned from this revision.
dexonsmith added a comment.

[Resigning as reviewer, since I was the original author, but feel free ping me explicitly if I can be helpful for something and I'm not volunteering anything...]

In D133715#3980058 <https://reviews.llvm.org/D133715#3980058>, @dblaikie wrote:

> Not sure I'll be the most detailed reviewer here, if anyone else wants to chime in.
>
> I've a generally understanding of what a trie is, but at my current level of detail looking at this code it's not entirely clear to me what this particular data structure is - maybe some more comments both in the source and in the patch description? simple example usage (in a test and/or in those comments/patch description) migth be helpful

Yeah, probably more docs here would be good. Some notes that Steven might be able to use as a starting point (although feel free to ignore this an do your own thing!):

At a high-level, this is a set/map data structure that supports fast thread-safe insertion/lookup and does not support erase. (IIRC, the interface similar to a `std::map`, where the key is a hash?)

A hash-mapped trie is a tree of arrays where the prefix of the hash is consumed as an "index" into the array at each level of the tree. IIRC, this data structure as implemented:

- is lock-free and thread-safe
- supports "insert" and "lookup"
- does NOT detect a hash collision; first insertion wins; you want a strong-enough hash for your data set that there are no collisions, or handle collisions in the `value_type` somehow
- does NOT support "erase"; IIRC, that'd be hard to do in a thread-safe way, but I didn't think hard about it
- does NOT support "iteration"; IIRC, it'd be easy to implement iteration that exposed the internal layout
- array sizes are configurable (root array can be a different size from sub-trie arrays)
- each slot in an array is either empty, a sub-trie, or a value

Insertion (and lookup) works basically like the following (skipping over details that make the lock-free-thead-safety work):

1. start with the root trie's array
2. convert a prefix of the hash into an index into the current array
3. if the slot is a value with the same hash, return the existing value
4. if the slot is a value with a different hash, create a new sub-trie that contains the existing value, put the sub-trie in the slot, and continue with step (5)
5. if the slot is a sub-trie, descend and go back to step (2) with the unused suffix of the hash on the sub-trie's array
6. if the slot is empty, insert and return the new value

================
Comment at: llvm/include/llvm/ADT/HashMappedTrie.h:156-160
+    static HashT makeHash(ArrayRef<uint8_t> HashRef) {
+      HashT Hash;
+      std::copy(HashRef.begin(), HashRef.end(), Hash.data());
+      return Hash;
+    }
----------------
dblaikie wrote:
> This sort of feels a bit confusing - why are we hashing a hash?
This is creating a `std::array<>` out of an `ArrayRef`. `std::array`'s constructors don't make this easy, so a wrapper function is helpful. Probably `copyHash` is a better name, or maybe there's a better way to do this!

================
Comment at: llvm/unittests/ADT/HashMappedTrieTest.cpp:71-75
+  // Dump out the trie so we can confirm the structure is correct. Each subtrie
+  // should have 2 slots. The root's index=0 should have the content for
+  // 0x37 directly, and index=1 should be a linked-list of subtries, finally
+  // ending with content for (max-2) and (max-3).
+  //
----------------
dblaikie wrote:
> Seems unfortunate to do significant amounts of testing via the stringification of an object - any chance this could be done in a more API-centric way?
This is a good point. It was hard to test this way, and it's probably hard to maintain the tests.

The motivating thought was to avoid adding APIs that exposed the layout so that clients couldn't inspect/rely on it (i.e., there are no APIs for iterating through all values). But maybe that is unnecessarily restrictive.

IIRC, it would be fairly easy to change `Trie::pointer` to a recursive iterator, or to expose an iterator that wrapped it, or something. `operator++` might be quite slow and/or be thread-unsafe, but maybe it's worth opening up just for the testing benefit.

(I'm happy either way; just chiming in in case I have more context about motivation for the currently-sparse API.)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133715/new/

https://reviews.llvm.org/D133715