[llvm] [CAS] Add OnDiskTrieRawHashMap (PR #114100)
Steven Wu via llvm-commits
llvm-commits at lists.llvm.org
Thu Sep 18 08:37:23 PDT 2025
================
@@ -0,0 +1,351 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file
+/// This file declares interface for OnDiskTrieRawHashMap, a thread-safe and
+/// (mostly) lock-free hash map stored as trie and backed by persistent files on
+/// disk.
+///
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CAS_ONDISKHASHMAPPEDTRIE_H
+#define LLVM_CAS_ONDISKHASHMAPPEDTRIE_H
+
+#include "llvm/ADT/ArrayRef.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/STLFunctionalExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/Error.h"
+#include <optional>
+
+namespace llvm {
+
+class raw_ostream;
+
+namespace cas {
+
+/// FileOffset is a wrapper around `int64_t` to represent the offset of data
+/// from the beginning of the file.
+class FileOffset {
+public:
+ int64_t get() const { return Offset; }
+
+ explicit operator bool() const { return Offset; }
+
+ FileOffset() = default;
+ explicit FileOffset(int64_t Offset) : Offset(Offset) { assert(Offset >= 0); }
+
+private:
+ int64_t Offset = 0;
+};
+
+/// OnDiskTrieRawHashMap is a persistent trie data structure used as hash maps.
+/// The keys are fixed length, and are expected to be binary hashes with a
+/// normal distribution.
+///
+/// - Thread-safety is achieved through the use of atomics within a shared
+/// memory mapping. Atomic access does not work on networked filesystems.
+/// - Filesystem locks are used, but only sparingly:
+/// - during initialization, for creating / opening an existing store;
+/// - for the lifetime of the instance, a shared/reader lock is held
+/// - during destruction, if there are no concurrent readers, to shrink the
+/// files to their minimum size.
+/// - Path is used as a directory:
+/// - "index" stores the root trie and subtries.
+/// - "data" stores (most of) the entries, like a bump-ptr-allocator.
+/// - Large entries are stored externally in a file named by the key.
+/// - Code is system-dependent (Windows not yet implemented), and binary format
+/// itself is not portable. These are not artifacts that can/should be moved
+/// between different systems; they are only appropriate for local storage.
+class OnDiskTrieRawHashMap {
+public:
+ LLVM_DUMP_METHOD void dump() const;
+ void
+ print(raw_ostream &OS,
+ function_ref<void(ArrayRef<char>)> PrintRecordData = nullptr) const;
+
+public:
+ /// Const value proxy to access the records stored in TrieRawHashMap.
+ struct ConstValueProxy {
+ ConstValueProxy() = default;
+ ConstValueProxy(ArrayRef<uint8_t> Hash, ArrayRef<char> Data)
+ : Hash(Hash), Data(Data) {}
+ ConstValueProxy(ArrayRef<uint8_t> Hash, StringRef Data)
+ : Hash(Hash), Data(Data.begin(), Data.size()) {}
+
+ ArrayRef<uint8_t> Hash;
+ ArrayRef<char> Data;
----------------
cachemeifyoucan wrote:
What suggestions you have? It is quite natural as Data can be get as `StringRef` which is `const char*`, but I guess you are asking why `Hash` and `Data` are different types? I don't really have preference either way.
https://github.com/llvm/llvm-project/pull/114100
More information about the llvm-commits
mailing list