[llvm] [CAS] Add LLVMCAS library with InMemoryCAS implementation (PR #114096)
Steven Wu via llvm-commits
llvm-commits at lists.llvm.org
Fri Aug 8 16:37:45 PDT 2025
https://github.com/cachemeifyoucan updated https://github.com/llvm/llvm-project/pull/114096
>From 9bf0f3079c410eb096ad3c2cefb89679bd34282b Mon Sep 17 00:00:00 2001
From: Steven Wu <stevenwu at apple.com>
Date: Tue, 29 Oct 2024 10:36:55 -0700
Subject: [PATCH 1/5] =?UTF-8?q?[=F0=9D=98=80=F0=9D=97=BD=F0=9D=97=BF]=20in?=
=?UTF-8?q?itial=20version?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Created using spr 1.3.5
---
llvm/docs/ContentAddressableStorage.md | 120 +++++++
llvm/docs/Reference.rst | 4 +
llvm/include/llvm/CAS/BuiltinCASContext.h | 88 +++++
llvm/include/llvm/CAS/BuiltinObjectHasher.h | 81 +++++
llvm/include/llvm/CAS/CASID.h | 156 +++++++++
llvm/include/llvm/CAS/CASReference.h | 207 +++++++++++
llvm/include/llvm/CAS/ObjectStore.h | 302 ++++++++++++++++
llvm/include/module.modulemap | 6 +
llvm/lib/CAS/BuiltinCAS.cpp | 94 +++++
llvm/lib/CAS/BuiltinCAS.h | 74 ++++
llvm/lib/CAS/CMakeLists.txt | 8 +
llvm/lib/CAS/InMemoryCAS.cpp | 320 +++++++++++++++++
llvm/lib/CAS/ObjectStore.cpp | 168 +++++++++
llvm/lib/CMakeLists.txt | 1 +
llvm/unittests/CAS/CASTestConfig.cpp | 22 ++
llvm/unittests/CAS/CASTestConfig.h | 32 ++
llvm/unittests/CAS/CMakeLists.txt | 12 +
llvm/unittests/CAS/ObjectStoreTest.cpp | 360 ++++++++++++++++++++
llvm/unittests/CMakeLists.txt | 1 +
19 files changed, 2056 insertions(+)
create mode 100644 llvm/docs/ContentAddressableStorage.md
create mode 100644 llvm/include/llvm/CAS/BuiltinCASContext.h
create mode 100644 llvm/include/llvm/CAS/BuiltinObjectHasher.h
create mode 100644 llvm/include/llvm/CAS/CASID.h
create mode 100644 llvm/include/llvm/CAS/CASReference.h
create mode 100644 llvm/include/llvm/CAS/ObjectStore.h
create mode 100644 llvm/lib/CAS/BuiltinCAS.cpp
create mode 100644 llvm/lib/CAS/BuiltinCAS.h
create mode 100644 llvm/lib/CAS/CMakeLists.txt
create mode 100644 llvm/lib/CAS/InMemoryCAS.cpp
create mode 100644 llvm/lib/CAS/ObjectStore.cpp
create mode 100644 llvm/unittests/CAS/CASTestConfig.cpp
create mode 100644 llvm/unittests/CAS/CASTestConfig.h
create mode 100644 llvm/unittests/CAS/CMakeLists.txt
create mode 100644 llvm/unittests/CAS/ObjectStoreTest.cpp
diff --git a/llvm/docs/ContentAddressableStorage.md b/llvm/docs/ContentAddressableStorage.md
new file mode 100644
index 0000000000000..4f2d9a6a3a918
--- /dev/null
+++ b/llvm/docs/ContentAddressableStorage.md
@@ -0,0 +1,120 @@
+# Content Addressable Storage
+
+## Introduction to CAS
+
+Content Addressable Storage, or `CAS`, is a storage system where it assigns
+unique addresses to the data stored. It is very useful for data deduplicaton
+and creating unique identifiers.
+
+Unlikely other kind of storage system like file system, CAS is immutable. It
+is more reliable to model a computation when representing the inputs and outputs
+of the computation using objects stored in CAS.
+
+The basic unit of the CAS library is a CASObject, where it contains:
+
+* Data: arbitrary data
+* References: references to other CASObject
+
+It can be conceptually modeled as something like:
+
+```
+struct CASObject {
+ ArrayRef<char> Data;
+ ArrayRef<CASObject*> Refs;
+}
+```
+
+Such abstraction can allow simple composition of CASObjects into a DAG to
+represent complicated data structure while still allowing data deduplication.
+Note you can compare two DAGs by just comparing the CASObject hash of two
+root nodes.
+
+
+
+## LLVM CAS Library User Guide
+
+The CAS-like storage provided in LLVM is `llvm::cas::ObjectStore`.
+To reference a CASObject, there are few different abstractions provided
+with different trade-offs:
+
+### ObjectRef
+
+`ObjectRef` is a lightweight reference to a CASObject stored in the CAS.
+This is the most commonly used abstraction and it is cheap to copy/pass
+along. It has following properties:
+
+* `ObjectRef` is only meaningful within the `ObjectStore` that created the ref.
+`ObjectRef` created by different `ObjectStore` cannot be cross-referenced or
+compared.
+* `ObjectRef` doesn't guarantee the existence of the CASObject it points to. An
+explicitly load is required before accessing the data stored in CASObject.
+This load can also fail, for reasons like but not limited to: object does
+not exist, corrupted CAS storage, operation timeout, etc.
+* If two `ObjectRef` are equal, it is guarantee that the object they point to
+(if exists) are identical. If they are not equal, the underlying objects are
+guaranteed to be not the same.
+
+### ObjectProxy
+
+`ObjectProxy` represents a loaded CASObject. With an `ObjectProxy`, the
+underlying stored data and references can be accessed without the need
+of error handling. The class APIs also provide convenient methods to
+access underlying data. The lifetime of the underlying data is equal to
+the lifetime of the instance of `ObjectStore` unless explicitly copied.
+
+### CASID
+
+`CASID` is the hash identifier for CASObjects. It owns the underlying
+storage for hash value so it can be expensive to copy and compare depending
+on the hash algorithm. `CASID` is generally only useful in rare situations
+like printing raw hash value or exchanging hash values between different
+CAS instances with the same hashing schema.
+
+### ObjectStore
+
+`ObjectStore` is the CAS-like object storage. It provides API to save
+and load CASObjects, for example:
+
+```
+ObjectRef A, B, C;
+Expected<ObjectRef> Stored = ObjectStore.store("data", {A, B});
+Expected<ObjectProxy> Loaded = ObjectStore.getProxy(C);
+```
+
+It also provides APIs to convert between `ObjectRef`, `ObjectProxy` and
+`CASID`.
+
+
+
+## CAS Library Implementation Guide
+
+The LLVM ObjectStore APIs are designed so that it is easy to add
+customized CAS implementation that are interchangeable with builtin
+CAS implementations.
+
+To add your own implementation, you just need to add a subclass to
+`llvm::cas::ObjectStore` and implement all its pure virtual methods.
+To be interchangeable with LLVM ObjectStore, the new CAS implementation
+needs to conform to following contracts:
+
+* Different CASObject stored in the ObjectStore needs to have a different hash
+and result in a different `ObjectRef`. Vice versa, same CASObject should have
+same hash and same `ObjectRef`. Note two different CASObjects with identical
+data but different references are considered different objects.
+* `ObjectRef`s are comparable within the same `ObjectStore` instance, and can
+be used to determine the equality of the underlying CASObjects.
+* The loaded objects from the ObjectStore need to have the lifetime to be at
+least as long as the ObjectStore itself.
+
+If not specified, the behavior can be implementation defined. For example,
+`ObjectRef` can be used to point to a loaded CASObject so
+`ObjectStore` never fails to load. It is also legal to use a stricter model
+than required. For example, an `ObjectRef` that can be used to compare
+objects between different `ObjectStore` instances is legal but user
+of the ObjectStore should not depend on this behavior.
+
+For CAS library implementer, there is also a `ObjectHandle` class that
+is an internal representation of a loaded CASObject reference.
+`ObjectProxy` is just a pair of `ObjectHandle` and `ObjectStore`, because
+just like `ObjectRef`, `ObjectHandle` is only useful when paired with
+the ObjectStore that knows about the loaded CASObject.
diff --git a/llvm/docs/Reference.rst b/llvm/docs/Reference.rst
index df61628b06c7d..ae03a3a7bfa9a 100644
--- a/llvm/docs/Reference.rst
+++ b/llvm/docs/Reference.rst
@@ -15,6 +15,7 @@ LLVM and API reference documentation.
BranchWeightMetadata
Bugpoint
CommandGuide/index
+ ContentAddressableStorage
ConvergenceAndUniformity
ConvergentOperations
Coroutines
@@ -232,3 +233,6 @@ Additional Topics
:doc:`ConvergenceAndUniformity`
A description of uniformity analysis in the presence of irreducible
control flow, and its implementation.
+
+:doc:`ContentAddressableStorage`
+ A reference guide for using LLVM's CAS library.
diff --git a/llvm/include/llvm/CAS/BuiltinCASContext.h b/llvm/include/llvm/CAS/BuiltinCASContext.h
new file mode 100644
index 0000000000000..ebc4ca8bd1f2e
--- /dev/null
+++ b/llvm/include/llvm/CAS/BuiltinCASContext.h
@@ -0,0 +1,88 @@
+//===- BuiltinCASContext.h --------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CAS_BUILTINCASCONTEXT_H
+#define LLVM_CAS_BUILTINCASCONTEXT_H
+
+#include "llvm/CAS/CASID.h"
+#include "llvm/Support/BLAKE3.h"
+#include "llvm/Support/Error.h"
+
+namespace llvm::cas::builtin {
+
+/// Current hash type for the builtin CAS.
+///
+/// FIXME: This should be configurable via an enum to allow configuring the hash
+/// function. The enum should be sent into \a createInMemoryCAS() and \a
+/// createOnDiskCAS().
+///
+/// This is important (at least) for future-proofing, when we want to make new
+/// CAS instances use BLAKE7, but still know how to read/write BLAKE3.
+///
+/// Even just for BLAKE3, it would be useful to have these values:
+///
+/// BLAKE3 => 32B hash from BLAKE3
+/// BLAKE3_16B => 16B hash from BLAKE3 (truncated)
+///
+/// ... where BLAKE3_16 uses \a TruncatedBLAKE3<16>.
+///
+/// Motivation for a truncated hash is that it's cheaper to store. It's not
+/// clear if we always (or ever) need the full 32B, and for an ephemeral
+/// in-memory CAS, we almost certainly don't need it.
+///
+/// Note that the cost is linear in the number of objects for the builtin CAS,
+/// since we're using internal offsets and/or pointers as an optimization.
+///
+/// However, it's possible we'll want to hook up a local builtin CAS to, e.g.,
+/// a distributed generic hash map to use as an ActionCache. In that scenario,
+/// the transitive closure of the structured objects that are the results of
+/// the cached actions would need to be serialized into the map, something
+/// like:
+///
+/// "action:<schema>:<key>" -> "0123"
+/// "object:<schema>:0123" -> "3,4567,89AB,CDEF,9,some data"
+/// "object:<schema>:4567" -> ...
+/// "object:<schema>:89AB" -> ...
+/// "object:<schema>:CDEF" -> ...
+///
+/// These references would be full cost.
+using HasherT = BLAKE3;
+using HashType = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>()));
+
+class BuiltinCASContext : public CASContext {
+ void printIDImpl(raw_ostream &OS, const CASID &ID) const final;
+ void anchor() override;
+
+public:
+ /// Get the name of the hash for any table identifiers.
+ ///
+ /// FIXME: This should be configurable via an enum, with at the following
+ /// values:
+ ///
+ /// "BLAKE3" => 32B hash from BLAKE3
+ /// "BLAKE3.16" => 16B hash from BLAKE3 (truncated)
+ ///
+ /// Enum can be sent into \a createInMemoryCAS() and \a createOnDiskCAS().
+ static StringRef getHashName() { return "BLAKE3"; }
+ StringRef getHashSchemaIdentifier() const final {
+ static const std::string ID =
+ ("llvm.cas.builtin.v2[" + getHashName() + "]").str();
+ return ID;
+ }
+
+ static const BuiltinCASContext &getDefaultContext();
+
+ BuiltinCASContext() = default;
+
+ static Expected<HashType> parseID(StringRef PrintedDigest);
+ static void printID(ArrayRef<uint8_t> Digest, raw_ostream &OS);
+};
+
+} // namespace llvm::cas::builtin
+
+#endif // LLVM_CAS_BUILTINCASCONTEXT_H
diff --git a/llvm/include/llvm/CAS/BuiltinObjectHasher.h b/llvm/include/llvm/CAS/BuiltinObjectHasher.h
new file mode 100644
index 0000000000000..22e556c5669b5
--- /dev/null
+++ b/llvm/include/llvm/CAS/BuiltinObjectHasher.h
@@ -0,0 +1,81 @@
+//===- BuiltinObjectHasher.h ------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CAS_BUILTINOBJECTHASHER_H
+#define LLVM_CAS_BUILTINOBJECTHASHER_H
+
+#include "llvm/CAS/ObjectStore.h"
+#include "llvm/Support/Endian.h"
+
+namespace llvm::cas {
+
+template <class HasherT> class BuiltinObjectHasher {
+public:
+ using HashT = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>()));
+
+ static HashT hashObject(const ObjectStore &CAS, ArrayRef<ObjectRef> Refs,
+ ArrayRef<char> Data) {
+ BuiltinObjectHasher H;
+ H.updateSize(Refs.size());
+ for (const ObjectRef &Ref : Refs)
+ H.updateRef(CAS, Ref);
+ H.updateArray(Data);
+ return H.finish();
+ }
+
+ static HashT hashObject(ArrayRef<ArrayRef<uint8_t>> Refs,
+ ArrayRef<char> Data) {
+ BuiltinObjectHasher H;
+ H.updateSize(Refs.size());
+ for (const ArrayRef<uint8_t> &Ref : Refs)
+ H.updateID(Ref);
+ H.updateArray(Data);
+ return H.finish();
+ }
+
+private:
+ HashT finish() { return Hasher.final(); }
+
+ void updateRef(const ObjectStore &CAS, ObjectRef Ref) {
+ updateID(CAS.getID(Ref));
+ }
+
+ void updateID(const CASID &ID) { updateID(ID.getHash()); }
+
+ void updateID(ArrayRef<uint8_t> Hash) {
+ // NOTE: Does not hash the size of the hash. That's a CAS implementation
+ // detail that shouldn't leak into the UUID for an object.
+ assert(Hash.size() == sizeof(HashT) &&
+ "Expected object ref to match the hash size");
+ Hasher.update(Hash);
+ }
+
+ void updateArray(ArrayRef<uint8_t> Bytes) {
+ updateSize(Bytes.size());
+ Hasher.update(Bytes);
+ }
+
+ void updateArray(ArrayRef<char> Bytes) {
+ updateArray(ArrayRef(reinterpret_cast<const uint8_t *>(Bytes.data()),
+ Bytes.size()));
+ }
+
+ void updateSize(uint64_t Size) {
+ Size = support::endian::byte_swap(Size, endianness::little);
+ Hasher.update(
+ ArrayRef(reinterpret_cast<const uint8_t *>(&Size), sizeof(Size)));
+ }
+
+ BuiltinObjectHasher() = default;
+ ~BuiltinObjectHasher() = default;
+ HasherT Hasher;
+};
+
+} // namespace llvm::cas
+
+#endif // LLVM_CAS_BUILTINOBJECTHASHER_H
diff --git a/llvm/include/llvm/CAS/CASID.h b/llvm/include/llvm/CAS/CASID.h
new file mode 100644
index 0000000000000..5f9110a15819a
--- /dev/null
+++ b/llvm/include/llvm/CAS/CASID.h
@@ -0,0 +1,156 @@
+//===- llvm/CAS/CASID.h -----------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CAS_CASID_H
+#define LLVM_CAS_CASID_H
+
+#include "llvm/ADT/ArrayRef.h"
+#include "llvm/ADT/DenseMapInfo.h"
+#include "llvm/ADT/SmallString.h"
+#include "llvm/ADT/StringExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/Error.h"
+
+namespace llvm {
+
+class raw_ostream;
+
+namespace cas {
+
+class CASID;
+
+/// Context for CAS identifiers.
+class CASContext {
+ virtual void anchor();
+
+public:
+ virtual ~CASContext() = default;
+
+ /// Get an identifer for the schema used by this CAS context. Two CAS
+ /// instances should return \c true for this identifier if and only if their
+ /// CASIDs are safe to compare by hash. This is used by \a
+ /// CASID::equalsImpl().
+ virtual StringRef getHashSchemaIdentifier() const = 0;
+
+protected:
+ /// Print \p ID to \p OS.
+ virtual void printIDImpl(raw_ostream &OS, const CASID &ID) const = 0;
+
+ friend class CASID;
+};
+
+/// Unique identifier for a CAS object.
+///
+/// Locally, stores an internal CAS identifier that's specific to a single CAS
+/// instance. It's guaranteed not to change across the view of that CAS, but
+/// might change between runs.
+///
+/// It also has \a CASIDContext pointer to allow comparison of these
+/// identifiers. If two CASIDs are from the same CASIDContext, they can be
+/// compared directly. If they are, then \a
+/// CASIDContext::getHashSchemaIdentifier() is compared to see if they can be
+/// compared by hash, in which case the result of \a getHash() is compared.
+class CASID {
+public:
+ void dump() const;
+ void print(raw_ostream &OS) const {
+ return getContext().printIDImpl(OS, *this);
+ }
+ friend raw_ostream &operator<<(raw_ostream &OS, const CASID &ID) {
+ ID.print(OS);
+ return OS;
+ }
+ std::string toString() const;
+
+ ArrayRef<uint8_t> getHash() const {
+ return arrayRefFromStringRef<uint8_t>(Hash);
+ }
+
+ friend bool operator==(const CASID &LHS, const CASID &RHS) {
+ if (LHS.Context == RHS.Context)
+ return LHS.Hash == RHS.Hash;
+
+ // EmptyKey or TombstoneKey.
+ if (!LHS.Context || !RHS.Context)
+ return false;
+
+ // CASIDs are equal when they have the same hash schema and same hash value.
+ return LHS.Context->getHashSchemaIdentifier() ==
+ RHS.Context->getHashSchemaIdentifier() &&
+ LHS.Hash == RHS.Hash;
+ }
+
+ friend bool operator!=(const CASID &LHS, const CASID &RHS) {
+ return !(LHS == RHS);
+ }
+
+ friend hash_code hash_value(const CASID &ID) {
+ ArrayRef<uint8_t> Hash = ID.getHash();
+ return hash_combine_range(Hash.begin(), Hash.end());
+ }
+
+ const CASContext &getContext() const {
+ assert(Context && "Tombstone or empty key for DenseMap?");
+ return *Context;
+ }
+
+ static CASID getDenseMapEmptyKey() {
+ return CASID(nullptr, DenseMapInfo<StringRef>::getEmptyKey());
+ }
+ static CASID getDenseMapTombstoneKey() {
+ return CASID(nullptr, DenseMapInfo<StringRef>::getTombstoneKey());
+ }
+
+ CASID() = delete;
+
+ static CASID create(const CASContext *Context, StringRef Hash) {
+ return CASID(Context, Hash);
+ }
+
+private:
+ CASID(const CASContext *Context, StringRef Hash)
+ : Context(Context), Hash(Hash) {}
+
+ const CASContext *Context;
+ SmallString<32> Hash;
+};
+
+/// This is used to workaround the issue of MSVC needing default-constructible
+/// types for \c std::promise/future.
+template <typename T> struct AsyncValue {
+ Expected<std::optional<T>> take() { return std::move(Value); }
+
+ AsyncValue() : Value(std::nullopt) {}
+ AsyncValue(Error &&E) : Value(std::move(E)) {}
+ AsyncValue(T &&V) : Value(std::move(V)) {}
+ AsyncValue(std::nullopt_t) : Value(std::nullopt) {}
+ AsyncValue(Expected<std::optional<T>> &&Obj) : Value(std::move(Obj)) {}
+
+private:
+ Expected<std::optional<T>> Value;
+};
+
+} // namespace cas
+
+template <> struct DenseMapInfo<cas::CASID> {
+ static cas::CASID getEmptyKey() { return cas::CASID::getDenseMapEmptyKey(); }
+
+ static cas::CASID getTombstoneKey() {
+ return cas::CASID::getDenseMapTombstoneKey();
+ }
+
+ static unsigned getHashValue(cas::CASID ID) {
+ return (unsigned)hash_value(ID);
+ }
+
+ static bool isEqual(cas::CASID LHS, cas::CASID RHS) { return LHS == RHS; }
+};
+
+} // namespace llvm
+
+#endif // LLVM_CAS_CASID_H
diff --git a/llvm/include/llvm/CAS/CASReference.h b/llvm/include/llvm/CAS/CASReference.h
new file mode 100644
index 0000000000000..1f435cf306c4c
--- /dev/null
+++ b/llvm/include/llvm/CAS/CASReference.h
@@ -0,0 +1,207 @@
+//===- llvm/CAS/CASReference.h ----------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CAS_CASREFERENCE_H
+#define LLVM_CAS_CASREFERENCE_H
+
+#include "llvm/ADT/ArrayRef.h"
+#include "llvm/ADT/DenseMapInfo.h"
+#include "llvm/ADT/StringRef.h"
+
+namespace llvm {
+
+class raw_ostream;
+
+namespace cas {
+
+class ObjectStore;
+
+class ObjectHandle;
+class ObjectRef;
+
+/// Base class for references to things in \a ObjectStore.
+class ReferenceBase {
+protected:
+ struct DenseMapEmptyTag {};
+ struct DenseMapTombstoneTag {};
+ static constexpr uint64_t getDenseMapEmptyRef() { return -1ULL; }
+ static constexpr uint64_t getDenseMapTombstoneRef() { return -2ULL; }
+
+public:
+ /// Get an internal reference.
+ uint64_t getInternalRef(const ObjectStore &ExpectedCAS) const {
+#if LLVM_ENABLE_ABI_BREAKING_CHECKS
+ assert(CAS == &ExpectedCAS && "Extracting reference for the wrong CAS");
+#endif
+ return InternalRef;
+ }
+
+ unsigned getDenseMapHash() const {
+ return (unsigned)llvm::hash_value(InternalRef);
+ }
+ bool isDenseMapEmpty() const { return InternalRef == getDenseMapEmptyRef(); }
+ bool isDenseMapTombstone() const {
+ return InternalRef == getDenseMapTombstoneRef();
+ }
+ bool isDenseMapSentinel() const {
+ return isDenseMapEmpty() || isDenseMapTombstone();
+ }
+
+protected:
+ void print(raw_ostream &OS, const ObjectHandle &This) const;
+ void print(raw_ostream &OS, const ObjectRef &This) const;
+
+ bool hasSameInternalRef(const ReferenceBase &RHS) const {
+#if LLVM_ENABLE_ABI_BREAKING_CHECKS
+ assert(
+ (isDenseMapSentinel() || RHS.isDenseMapSentinel() || CAS == RHS.CAS) &&
+ "Cannot compare across CAS instances");
+#endif
+ return InternalRef == RHS.InternalRef;
+ }
+
+protected:
+ friend class ObjectStore;
+ ReferenceBase(const ObjectStore *CAS, uint64_t InternalRef, bool IsHandle)
+ : InternalRef(InternalRef) {
+#if LLVM_ENABLE_ABI_BREAKING_CHECKS
+ this->CAS = CAS;
+#endif
+ assert(InternalRef != getDenseMapEmptyRef() && "Reserved for DenseMapInfo");
+ assert(InternalRef != getDenseMapTombstoneRef() &&
+ "Reserved for DenseMapInfo");
+ }
+ explicit ReferenceBase(DenseMapEmptyTag)
+ : InternalRef(getDenseMapEmptyRef()) {}
+ explicit ReferenceBase(DenseMapTombstoneTag)
+ : InternalRef(getDenseMapTombstoneRef()) {}
+
+private:
+ uint64_t InternalRef;
+
+#if LLVM_ENABLE_ABI_BREAKING_CHECKS
+ const ObjectStore *CAS = nullptr;
+#endif
+};
+
+/// Reference to an object in a \a ObjectStore instance.
+///
+/// If you have an ObjectRef, you know the object exists, and you can point at
+/// it from new nodes with \a ObjectStore::store(), but you don't know anything
+/// about it. "Loading" the object is a separate step that may not have
+/// happened yet, and which can fail (due to filesystem corruption) or
+/// introduce latency (if downloading from a remote store).
+///
+/// \a ObjectStore::store() takes a list of these, and these are returned by \a
+/// ObjectStore::forEachRef() and \a ObjectStore::readRef(), which are accessors
+/// for nodes, and \a ObjectStore::getReference().
+///
+/// \a ObjectStore::load() will load the referenced object, and returns \a
+/// ObjectHandle, a variant that knows what kind of entity it is. \a
+/// ObjectStore::getReferenceKind() can expect the type of reference without
+/// asking for unloaded objects to be loaded.
+///
+/// This is a wrapper around a \c uint64_t (and a \a ObjectStore instance when
+/// assertions are on). If necessary, it can be deconstructed and reconstructed
+/// using \a Reference::getInternalRef() and \a
+/// Reference::getFromInternalRef(), but clients aren't expected to need to do
+/// this. These both require the right \a ObjectStore instance.
+class ObjectRef : public ReferenceBase {
+ struct DenseMapTag {};
+
+public:
+ friend bool operator==(const ObjectRef &LHS, const ObjectRef &RHS) {
+ return LHS.hasSameInternalRef(RHS);
+ }
+ friend bool operator!=(const ObjectRef &LHS, const ObjectRef &RHS) {
+ return !(LHS == RHS);
+ }
+
+ /// Allow a reference to be recreated after it's deconstructed.
+ static ObjectRef getFromInternalRef(const ObjectStore &CAS,
+ uint64_t InternalRef) {
+ return ObjectRef(CAS, InternalRef);
+ }
+
+ static ObjectRef getDenseMapEmptyKey() {
+ return ObjectRef(DenseMapEmptyTag{});
+ }
+ static ObjectRef getDenseMapTombstoneKey() {
+ return ObjectRef(DenseMapTombstoneTag{});
+ }
+
+ /// Print internal ref and/or CASID. Only suitable for debugging.
+ void print(raw_ostream &OS) const { return ReferenceBase::print(OS, *this); }
+
+ LLVM_DUMP_METHOD void dump() const;
+
+private:
+ friend class ObjectStore;
+ friend class ReferenceBase;
+ using ReferenceBase::ReferenceBase;
+ ObjectRef(const ObjectStore &CAS, uint64_t InternalRef)
+ : ReferenceBase(&CAS, InternalRef, /*IsHandle=*/false) {
+ assert(InternalRef != -1ULL && "Reserved for DenseMapInfo");
+ assert(InternalRef != -2ULL && "Reserved for DenseMapInfo");
+ }
+ explicit ObjectRef(DenseMapEmptyTag T) : ReferenceBase(T) {}
+ explicit ObjectRef(DenseMapTombstoneTag T) : ReferenceBase(T) {}
+ explicit ObjectRef(ReferenceBase) = delete;
+};
+
+/// Handle to a loaded object in a \a ObjectStore instance.
+///
+/// ObjectHandle encapulates a *loaded* object in the CAS. You need one
+/// of these to inspect the content of an object: to look at its stored
+/// data and references.
+class ObjectHandle : public ReferenceBase {
+public:
+ friend bool operator==(const ObjectHandle &LHS, const ObjectHandle &RHS) {
+ return LHS.hasSameInternalRef(RHS);
+ }
+ friend bool operator!=(const ObjectHandle &LHS, const ObjectHandle &RHS) {
+ return !(LHS == RHS);
+ }
+
+ /// Print internal ref and/or CASID. Only suitable for debugging.
+ void print(raw_ostream &OS) const { return ReferenceBase::print(OS, *this); }
+
+ LLVM_DUMP_METHOD void dump() const;
+
+private:
+ friend class ObjectStore;
+ friend class ReferenceBase;
+ using ReferenceBase::ReferenceBase;
+ explicit ObjectHandle(ReferenceBase) = delete;
+ ObjectHandle(const ObjectStore &CAS, uint64_t InternalRef)
+ : ReferenceBase(&CAS, InternalRef, /*IsHandle=*/true) {}
+};
+
+} // namespace cas
+
+template <> struct DenseMapInfo<cas::ObjectRef> {
+ static cas::ObjectRef getEmptyKey() {
+ return cas::ObjectRef::getDenseMapEmptyKey();
+ }
+
+ static cas::ObjectRef getTombstoneKey() {
+ return cas::ObjectRef::getDenseMapTombstoneKey();
+ }
+
+ static unsigned getHashValue(cas::ObjectRef Ref) {
+ return Ref.getDenseMapHash();
+ }
+
+ static bool isEqual(cas::ObjectRef LHS, cas::ObjectRef RHS) {
+ return LHS == RHS;
+ }
+};
+
+} // namespace llvm
+
+#endif // LLVM_CAS_CASREFERENCE_H
diff --git a/llvm/include/llvm/CAS/ObjectStore.h b/llvm/include/llvm/CAS/ObjectStore.h
new file mode 100644
index 0000000000000..b4720c7edc154
--- /dev/null
+++ b/llvm/include/llvm/CAS/ObjectStore.h
@@ -0,0 +1,302 @@
+//===- llvm/CAS/ObjectStore.h -----------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CAS_OBJECTSTORE_H
+#define LLVM_CAS_OBJECTSTORE_H
+
+#include "llvm/ADT/StringRef.h"
+#include "llvm/CAS/CASID.h"
+#include "llvm/CAS/CASReference.h"
+#include "llvm/Support/Error.h"
+#include "llvm/Support/FileSystem.h"
+#include <cstddef>
+
+namespace llvm {
+
+class MemoryBuffer;
+template <typename T> class unique_function;
+
+namespace cas {
+
+class ObjectStore;
+class ObjectProxy;
+
+/// Content-addressable storage for objects.
+///
+/// Conceptually, objects are stored in a "unique set".
+///
+/// - Objects are immutable ("value objects") that are defined by their
+/// content. They are implicitly deduplicated by content.
+/// - Each object has a unique identifier (UID) that's derived from its content,
+/// called a \a CASID.
+/// - This UID is a fixed-size (strong) hash of the transitive content of a
+/// CAS object.
+/// - It's comparable between any two CAS instances that have the same \a
+/// CASIDContext::getHashSchemaIdentifier().
+/// - The UID can be printed (e.g., \a CASID::toString()) and it can parsed
+/// by the same or a different CAS instance with \a
+/// ObjectStore::parseID().
+/// - An object can be looked up by content or by UID.
+/// - \a store() is "get-or-create" methods, writing an object if it
+/// doesn't exist yet, and return a ref to it in any case.
+/// - \a loadObject(const CASID&) looks up an object by its UID.
+/// - Objects can reference other objects, forming an arbitrary DAG.
+///
+/// The \a ObjectStore interface has a few ways of referencing objects:
+///
+/// - \a ObjectRef encapsulates a reference to something in the CAS. It is an
+/// opaque type that references an object inside a specific CAS. It is
+/// implementation defined if the underlying object exists or not for an
+/// ObjectRef, and it can used to speed up CAS lookup as an implementation
+/// detail. However, you don't know anything about the underlying objects.
+/// "Loading" the object is a separate step that may not have happened
+/// yet, and which can fail (e.g. due to filesystem corruption) or introduce
+/// latency (if downloading from a remote store).
+/// - \a ObjectHandle encapulates a *loaded* object in the CAS. You need one of
+/// these to inspect the content of an object: to look at its stored
+/// data and references. This is internal to CAS implementation and not
+/// availble from CAS public APIs.
+/// - \a CASID: the UID for an object in the CAS, obtained through \a
+/// ObjectStore::getID() or \a ObjectStore::parseID(). This is a valid CAS
+/// identifier, but may reference an object that is unknown to this CAS
+/// instance.
+/// - \a ObjectProxy pairs an ObjectHandle (subclass) with a ObjectStore, and
+/// wraps access APIs to avoid having to pass extra parameters. It is the
+/// object used for accessing underlying data and refs by CAS users.
+///
+/// Both ObjectRef and ObjectHandle are lightweight, wrapping a `uint64_t` and
+/// are only valid with the associated ObjectStore instance.
+///
+/// There are a few options for accessing content of objects, with different
+/// lifetime tradeoffs:
+///
+/// - \a getData() accesses data without exposing lifetime at all.
+/// - \a getMemoryBuffer() returns a \a MemoryBuffer whose lifetime
+/// is independent of the CAS (it can live longer).
+/// - \a getDataString() return StringRef with lifetime is guaranteed to last as
+/// long as \a ObjectStore.
+/// - \a readRef() and \a forEachRef() iterate through the references in an
+/// object. There is no lifetime assumption.
+class ObjectStore {
+ friend class ObjectProxy;
+ void anchor();
+
+public:
+ /// Get a \p CASID from a \p ID, which should have been generated by \a
+ /// CASID::print(). This succeeds as long as \a validateID() would pass. The
+ /// object may be unknown to this CAS instance.
+ ///
+ /// TODO: Remove, and update callers to use \a validateID() or \a
+ /// extractHashFromID().
+ virtual Expected<CASID> parseID(StringRef ID) = 0;
+
+ /// Store object into ObjectStore.
+ virtual Expected<ObjectRef> store(ArrayRef<ObjectRef> Refs,
+ ArrayRef<char> Data) = 0;
+ /// Get an ID for \p Ref.
+ virtual CASID getID(ObjectRef Ref) const = 0;
+
+ /// Get an existing reference to the object called \p ID.
+ ///
+ /// Returns \c None if the object is not stored in this CAS.
+ virtual std::optional<ObjectRef> getReference(const CASID &ID) const = 0;
+
+ /// \returns true if the object is directly available from the local CAS, for
+ /// implementations that have this kind of distinction.
+ virtual Expected<bool> isMaterialized(ObjectRef Ref) const = 0;
+
+ /// Validate the underlying object referred by CASID.
+ virtual Error validate(const CASID &ID) = 0;
+
+protected:
+ /// Load the object referenced by \p Ref.
+ ///
+ /// Errors if the object cannot be loaded.
+ /// \returns \c std::nullopt if the object is missing from the CAS.
+ virtual Expected<std::optional<ObjectHandle>> loadIfExists(ObjectRef Ref) = 0;
+
+ /// Like \c loadIfExists but returns an error if the object is missing.
+ Expected<ObjectHandle> load(ObjectRef Ref);
+
+ /// Get the size of some data.
+ virtual uint64_t getDataSize(ObjectHandle Node) const = 0;
+
+ /// Methods for handling objects.
+ virtual Error forEachRef(ObjectHandle Node,
+ function_ref<Error(ObjectRef)> Callback) const = 0;
+ virtual ObjectRef readRef(ObjectHandle Node, size_t I) const = 0;
+ virtual size_t getNumRefs(ObjectHandle Node) const = 0;
+ virtual ArrayRef<char> getData(ObjectHandle Node,
+ bool RequiresNullTerminator = false) const = 0;
+
+ /// Get ObjectRef from open file.
+ virtual Expected<ObjectRef>
+ storeFromOpenFileImpl(sys::fs::file_t FD,
+ std::optional<sys::fs::file_status> Status);
+
+ /// Get a lifetime-extended StringRef pointing at \p Data.
+ ///
+ /// Depending on the CAS implementation, this may involve in-memory storage
+ /// overhead.
+ StringRef getDataString(ObjectHandle Node) {
+ return toStringRef(getData(Node));
+ }
+
+ /// Get a lifetime-extended MemoryBuffer pointing at \p Data.
+ ///
+ /// Depending on the CAS implementation, this may involve in-memory storage
+ /// overhead.
+ std::unique_ptr<MemoryBuffer>
+ getMemoryBuffer(ObjectHandle Node, StringRef Name = "",
+ bool RequiresNullTerminator = true);
+
+ /// Read all the refs from object in a SmallVector.
+ virtual void readRefs(ObjectHandle Node,
+ SmallVectorImpl<ObjectRef> &Refs) const;
+
+ /// Allow ObjectStore implementations to create internal handles.
+#define MAKE_CAS_HANDLE_CONSTRUCTOR(HandleKind) \
+ HandleKind make##HandleKind(uint64_t InternalRef) const { \
+ return HandleKind(*this, InternalRef); \
+ }
+ MAKE_CAS_HANDLE_CONSTRUCTOR(ObjectHandle)
+ MAKE_CAS_HANDLE_CONSTRUCTOR(ObjectRef)
+#undef MAKE_CAS_HANDLE_CONSTRUCTOR
+
+public:
+ /// Helper functions to store object and returns a ObjectProxy.
+ Expected<ObjectProxy> createProxy(ArrayRef<ObjectRef> Refs, StringRef Data);
+
+ /// Store object from StringRef.
+ Expected<ObjectRef> storeFromString(ArrayRef<ObjectRef> Refs,
+ StringRef String) {
+ return store(Refs, arrayRefFromStringRef<char>(String));
+ }
+
+ /// Default implementation reads \p FD and calls \a storeNode(). Does not
+ /// take ownership of \p FD; the caller is responsible for closing it.
+ ///
+ /// If \p Status is sent in it is to be treated as a hint. Implementations
+ /// must protect against the file size potentially growing after the status
+ /// was taken (i.e., they cannot assume that an mmap will be null-terminated
+ /// where \p Status implies).
+ ///
+ /// Returns the \a CASID and the size of the file.
+ Expected<ObjectRef>
+ storeFromOpenFile(sys::fs::file_t FD,
+ std::optional<sys::fs::file_status> Status = std::nullopt) {
+ return storeFromOpenFileImpl(FD, Status);
+ }
+
+ static Error createUnknownObjectError(const CASID &ID);
+
+ /// Create ObjectProxy from CASID. If the object doesn't exist, get an error.
+ Expected<ObjectProxy> getProxy(const CASID &ID);
+ /// Create ObjectProxy from ObjectRef. If the object can't be loaded, get an
+ /// error.
+ Expected<ObjectProxy> getProxy(ObjectRef Ref);
+
+ /// \returns \c std::nullopt if the object is missing from the CAS.
+ Expected<std::optional<ObjectProxy>> getProxyIfExists(ObjectRef Ref);
+
+ /// Read the data from \p Data into \p OS.
+ uint64_t readData(ObjectHandle Node, raw_ostream &OS, uint64_t Offset = 0,
+ uint64_t MaxBytes = -1ULL) const {
+ ArrayRef<char> Data = getData(Node);
+ assert(Offset < Data.size() && "Expected valid offset");
+ Data = Data.drop_front(Offset).take_front(MaxBytes);
+ OS << toStringRef(Data);
+ return Data.size();
+ }
+
+ /// Validate the whole node tree.
+ Error validateTree(ObjectRef Ref);
+
+ /// Print the ObjectStore internals for debugging purpose.
+ virtual void print(raw_ostream &) const {}
+ void dump() const;
+
+ /// Get CASContext
+ const CASContext &getContext() const { return Context; }
+
+ virtual ~ObjectStore() = default;
+
+protected:
+ ObjectStore(const CASContext &Context) : Context(Context) {}
+
+private:
+ const CASContext &Context;
+};
+
+/// Reference to an abstract hierarchical node, with data and references.
+/// Reference is passed by value and is expected to be valid as long as the \a
+/// ObjectStore is.
+class ObjectProxy {
+public:
+ const ObjectStore &getCAS() const { return *CAS; }
+ ObjectStore &getCAS() { return *CAS; }
+ CASID getID() const { return CAS->getID(Ref); }
+ ObjectRef getRef() const { return Ref; }
+ size_t getNumReferences() const { return CAS->getNumRefs(H); }
+ ObjectRef getReference(size_t I) const { return CAS->readRef(H, I); }
+
+ operator CASID() const { return getID(); }
+ CASID getReferenceID(size_t I) const {
+ std::optional<CASID> ID = getCAS().getID(getReference(I));
+ assert(ID && "Expected reference to be first-class object");
+ return *ID;
+ }
+
+ /// Visit each reference in order, returning an error from \p Callback to
+ /// stop early.
+ Error forEachReference(function_ref<Error(ObjectRef)> Callback) const {
+ return CAS->forEachRef(H, Callback);
+ }
+
+ std::unique_ptr<MemoryBuffer>
+ getMemoryBuffer(StringRef Name = "",
+ bool RequiresNullTerminator = true) const;
+
+ /// Get the content of the node. Valid as long as the CAS is valid.
+ StringRef getData() const { return CAS->getDataString(H); }
+
+ friend bool operator==(const ObjectProxy &Proxy, ObjectRef Ref) {
+ return Proxy.getRef() == Ref;
+ }
+ friend bool operator==(ObjectRef Ref, const ObjectProxy &Proxy) {
+ return Proxy.getRef() == Ref;
+ }
+ friend bool operator!=(const ObjectProxy &Proxy, ObjectRef Ref) {
+ return !(Proxy.getRef() == Ref);
+ }
+ friend bool operator!=(ObjectRef Ref, const ObjectProxy &Proxy) {
+ return !(Proxy.getRef() == Ref);
+ }
+
+public:
+ ObjectProxy() = delete;
+
+ static ObjectProxy load(ObjectStore &CAS, ObjectRef Ref, ObjectHandle Node) {
+ return ObjectProxy(CAS, Ref, Node);
+ }
+
+private:
+ ObjectProxy(ObjectStore &CAS, ObjectRef Ref, ObjectHandle H)
+ : CAS(&CAS), Ref(Ref), H(H) {}
+
+ ObjectStore *CAS;
+ ObjectRef Ref;
+ ObjectHandle H;
+};
+
+std::unique_ptr<ObjectStore> createInMemoryCAS();
+
+} // namespace cas
+} // namespace llvm
+
+#endif // LLVM_CAS_OBJECTSTORE_H
diff --git a/llvm/include/module.modulemap b/llvm/include/module.modulemap
index b00da6d7cd28c..d44d395fa8ef4 100644
--- a/llvm/include/module.modulemap
+++ b/llvm/include/module.modulemap
@@ -105,6 +105,12 @@ module LLVM_BinaryFormat {
textual header "llvm/BinaryFormat/MsgPack.def"
}
+module LLVM_CAS {
+ requires cplusplus
+ umbrella "llvm/CAS"
+ module * { export * }
+}
+
module LLVM_Config {
requires cplusplus
umbrella "llvm/Config"
diff --git a/llvm/lib/CAS/BuiltinCAS.cpp b/llvm/lib/CAS/BuiltinCAS.cpp
new file mode 100644
index 0000000000000..73646ad2c3528
--- /dev/null
+++ b/llvm/lib/CAS/BuiltinCAS.cpp
@@ -0,0 +1,94 @@
+//===- BuiltinCAS.cpp -------------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "BuiltinCAS.h"
+#include "llvm/ADT/StringExtras.h"
+#include "llvm/CAS/BuiltinObjectHasher.h"
+#include "llvm/Support/Process.h"
+
+using namespace llvm;
+using namespace llvm::cas;
+using namespace llvm::cas::builtin;
+
+static StringRef getCASIDPrefix() { return "llvmcas://"; }
+void BuiltinCASContext::anchor() {}
+
+Expected<HashType> BuiltinCASContext::parseID(StringRef Reference) {
+ if (!Reference.consume_front(getCASIDPrefix()))
+ return createStringError(std::make_error_code(std::errc::invalid_argument),
+ "invalid cas-id '" + Reference + "'");
+
+ // FIXME: Allow shortened references?
+ if (Reference.size() != 2 * sizeof(HashType))
+ return createStringError(std::make_error_code(std::errc::invalid_argument),
+ "wrong size for cas-id hash '" + Reference + "'");
+
+ std::string Binary;
+ if (!tryGetFromHex(Reference, Binary))
+ return createStringError(std::make_error_code(std::errc::invalid_argument),
+ "invalid hash in cas-id '" + Reference + "'");
+
+ assert(Binary.size() == sizeof(HashType));
+ HashType Digest;
+ llvm::copy(Binary, Digest.data());
+ return Digest;
+}
+
+Expected<CASID> BuiltinCAS::parseID(StringRef Reference) {
+ Expected<HashType> Digest = BuiltinCASContext::parseID(Reference);
+ if (!Digest)
+ return Digest.takeError();
+
+ return CASID::create(&getContext(), toStringRef(*Digest));
+}
+
+void BuiltinCASContext::printID(ArrayRef<uint8_t> Digest, raw_ostream &OS) {
+ SmallString<64> Hash;
+ toHex(Digest, /*LowerCase=*/true, Hash);
+ OS << getCASIDPrefix() << Hash;
+}
+
+void BuiltinCASContext::printIDImpl(raw_ostream &OS, const CASID &ID) const {
+ BuiltinCASContext::printID(ID.getHash(), OS);
+}
+
+const BuiltinCASContext &BuiltinCASContext::getDefaultContext() {
+ static BuiltinCASContext DefaultContext;
+ return DefaultContext;
+}
+
+Expected<ObjectRef> BuiltinCAS::store(ArrayRef<ObjectRef> Refs,
+ ArrayRef<char> Data) {
+ return storeImpl(BuiltinObjectHasher<HasherT>::hashObject(*this, Refs, Data),
+ Refs, Data);
+}
+
+Error BuiltinCAS::validate(const CASID &ID) {
+ auto Ref = getReference(ID);
+ if (!Ref)
+ return createUnknownObjectError(ID);
+
+ auto Handle = load(*Ref);
+ if (!Handle)
+ return Handle.takeError();
+
+ auto Proxy = ObjectProxy::load(*this, *Ref, *Handle);
+ SmallVector<ObjectRef> Refs;
+ if (auto E = Proxy.forEachReference([&](ObjectRef Ref) -> Error {
+ Refs.push_back(Ref);
+ return Error::success();
+ }))
+ return E;
+
+ ArrayRef<char> Data(Proxy.getData().data(), Proxy.getData().size());
+ auto Hash = BuiltinObjectHasher<HasherT>::hashObject(*this, Refs, Data);
+ if (!ID.getHash().equals(Hash))
+ return createCorruptObjectError(ID);
+
+ return Error::success();
+}
diff --git a/llvm/lib/CAS/BuiltinCAS.h b/llvm/lib/CAS/BuiltinCAS.h
new file mode 100644
index 0000000000000..1a4f640e4e2da
--- /dev/null
+++ b/llvm/lib/CAS/BuiltinCAS.h
@@ -0,0 +1,74 @@
+//===- BuiltinCAS.h ---------------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_CAS_BUILTINCAS_H
+#define LLVM_LIB_CAS_BUILTINCAS_H
+
+#include "llvm/ADT/StringRef.h"
+#include "llvm/CAS/BuiltinCASContext.h"
+#include "llvm/CAS/ObjectStore.h"
+
+namespace llvm::cas {
+class ActionCache;
+namespace builtin {
+
+class BuiltinCAS : public ObjectStore {
+public:
+ BuiltinCAS() : ObjectStore(BuiltinCASContext::getDefaultContext()) {}
+
+ Expected<CASID> parseID(StringRef Reference) final;
+
+ Expected<ObjectRef> store(ArrayRef<ObjectRef> Refs,
+ ArrayRef<char> Data) final;
+ virtual Expected<ObjectRef> storeImpl(ArrayRef<uint8_t> ComputedHash,
+ ArrayRef<ObjectRef> Refs,
+ ArrayRef<char> Data) = 0;
+
+ virtual Expected<ObjectRef>
+ storeFromNullTerminatedRegion(ArrayRef<uint8_t> ComputedHash,
+ sys::fs::mapped_file_region Map) {
+ return storeImpl(ComputedHash, std::nullopt,
+ ArrayRef(Map.data(), Map.size()));
+ }
+
+ /// Both builtin CAS implementations provide lifetime for free, so this can
+ /// be const, and readData() and getDataSize() can be implemented on top of
+ /// it.
+ virtual ArrayRef<char> getDataConst(ObjectHandle Node) const = 0;
+
+ ArrayRef<char> getData(ObjectHandle Node,
+ bool RequiresNullTerminator) const final {
+ // BuiltinCAS Objects are always null terminated.
+ return getDataConst(Node);
+ }
+ uint64_t getDataSize(ObjectHandle Node) const final {
+ return getDataConst(Node).size();
+ }
+
+ Error createUnknownObjectError(const CASID &ID) const {
+ return createStringError(std::make_error_code(std::errc::invalid_argument),
+ "unknown object '" + ID.toString() + "'");
+ }
+
+ Error createCorruptObjectError(const CASID &ID) const {
+ return createStringError(std::make_error_code(std::errc::invalid_argument),
+ "corrupt object '" + ID.toString() + "'");
+ }
+
+ Error createCorruptStorageError() const {
+ return createStringError(std::make_error_code(std::errc::invalid_argument),
+ "corrupt storage");
+ }
+
+ Error validate(const CASID &ID) final;
+};
+
+} // end namespace builtin
+} // end namespace llvm::cas
+
+#endif // LLVM_LIB_CAS_BUILTINCAS_H
diff --git a/llvm/lib/CAS/CMakeLists.txt b/llvm/lib/CAS/CMakeLists.txt
new file mode 100644
index 0000000000000..a486ab66ae426
--- /dev/null
+++ b/llvm/lib/CAS/CMakeLists.txt
@@ -0,0 +1,8 @@
+add_llvm_component_library(LLVMCAS
+ BuiltinCAS.cpp
+ InMemoryCAS.cpp
+ ObjectStore.cpp
+
+ ADDITIONAL_HEADER_DIRS
+ ${LLVM_MAIN_INCLUDE_DIR}/llvm/CAS
+)
diff --git a/llvm/lib/CAS/InMemoryCAS.cpp b/llvm/lib/CAS/InMemoryCAS.cpp
new file mode 100644
index 0000000000000..abdd7ed3ef805
--- /dev/null
+++ b/llvm/lib/CAS/InMemoryCAS.cpp
@@ -0,0 +1,320 @@
+//===- InMemoryCAS.cpp ------------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "BuiltinCAS.h"
+#include "llvm/ADT/LazyAtomicPointer.h"
+#include "llvm/ADT/PointerIntPair.h"
+#include "llvm/ADT/TrieRawHashMap.h"
+#include "llvm/Support/Allocator.h"
+#include "llvm/Support/Casting.h"
+#include "llvm/Support/ThreadSafeAllocator.h"
+
+using namespace llvm;
+using namespace llvm::cas;
+using namespace llvm::cas::builtin;
+
+namespace {
+
+class InMemoryObject;
+
+/// Index of referenced IDs (map: Hash -> InMemoryObject*). Uses
+/// LazyAtomicPointer to coordinate creation of objects.
+using InMemoryIndexT =
+ ThreadSafeTrieRawHashMap<LazyAtomicPointer<const InMemoryObject>,
+ sizeof(HashType)>;
+
+/// Values in \a InMemoryIndexT. \a InMemoryObject's point at this to access
+/// their hash.
+using InMemoryIndexValueT = InMemoryIndexT::value_type;
+
+class InMemoryObject {
+public:
+ enum class Kind {
+ /// Node with refs and data.
+ RefNode,
+
+ /// Node with refs and data co-allocated.
+ InlineNode,
+
+ Max = InlineNode,
+ };
+
+ Kind getKind() const { return IndexAndKind.getInt(); }
+ const InMemoryIndexValueT &getIndex() const {
+ assert(IndexAndKind.getPointer());
+ return *IndexAndKind.getPointer();
+ }
+
+ ArrayRef<uint8_t> getHash() const { return getIndex().Hash; }
+
+ InMemoryObject() = delete;
+ InMemoryObject(InMemoryObject &&) = delete;
+ InMemoryObject(const InMemoryObject &) = delete;
+
+protected:
+ InMemoryObject(Kind K, const InMemoryIndexValueT &I) : IndexAndKind(&I, K) {}
+
+private:
+ enum Counts : int {
+ NumKindBits = 2,
+ };
+ PointerIntPair<const InMemoryIndexValueT *, NumKindBits, Kind> IndexAndKind;
+ static_assert((1U << NumKindBits) <= alignof(InMemoryIndexValueT),
+ "Kind will clobber pointer");
+ static_assert(((int)Kind::Max >> NumKindBits) == 0, "Kind will be truncated");
+
+public:
+ inline ArrayRef<char> getData() const;
+
+ inline ArrayRef<const InMemoryObject *> getRefs() const;
+};
+
+class InMemoryRefObject : public InMemoryObject {
+public:
+ static constexpr Kind KindValue = Kind::RefNode;
+ static bool classof(const InMemoryObject *O) {
+ return O->getKind() == KindValue;
+ }
+
+ ArrayRef<const InMemoryObject *> getRefsImpl() const { return Refs; }
+ ArrayRef<const InMemoryObject *> getRefs() const { return Refs; }
+ ArrayRef<char> getDataImpl() const { return Data; }
+ ArrayRef<char> getData() const { return Data; }
+
+ static InMemoryRefObject &create(function_ref<void *(size_t Size)> Allocate,
+ const InMemoryIndexValueT &I,
+ ArrayRef<const InMemoryObject *> Refs,
+ ArrayRef<char> Data) {
+ void *Mem = Allocate(sizeof(InMemoryRefObject));
+ return *new (Mem) InMemoryRefObject(I, Refs, Data);
+ }
+
+private:
+ InMemoryRefObject(const InMemoryIndexValueT &I,
+ ArrayRef<const InMemoryObject *> Refs, ArrayRef<char> Data)
+ : InMemoryObject(KindValue, I), Refs(Refs), Data(Data) {
+ assert(isAddrAligned(Align(8), this) && "Expected 8-byte alignment");
+ assert(isAddrAligned(Align(8), Data.data()) && "Expected 8-byte alignment");
+ assert(*Data.end() == 0 && "Expected null-termination");
+ }
+
+ ArrayRef<const InMemoryObject *> Refs;
+ ArrayRef<char> Data;
+};
+
+class InMemoryInlineObject : public InMemoryObject {
+public:
+ static constexpr Kind KindValue = Kind::InlineNode;
+ static bool classof(const InMemoryObject *O) {
+ return O->getKind() == KindValue;
+ }
+
+ ArrayRef<const InMemoryObject *> getRefs() const { return getRefsImpl(); }
+ ArrayRef<const InMemoryObject *> getRefsImpl() const {
+ return ArrayRef(reinterpret_cast<const InMemoryObject *const *>(this + 1),
+ NumRefs);
+ }
+
+ ArrayRef<char> getData() const { return getDataImpl(); }
+ ArrayRef<char> getDataImpl() const {
+ ArrayRef<const InMemoryObject *> Refs = getRefs();
+ return ArrayRef(reinterpret_cast<const char *>(Refs.data() + Refs.size()),
+ DataSize);
+ }
+
+ static InMemoryInlineObject &
+ create(function_ref<void *(size_t Size)> Allocate,
+ const InMemoryIndexValueT &I, ArrayRef<const InMemoryObject *> Refs,
+ ArrayRef<char> Data) {
+ void *Mem = Allocate(sizeof(InMemoryInlineObject) +
+ sizeof(uintptr_t) * Refs.size() + Data.size() + 1);
+ return *new (Mem) InMemoryInlineObject(I, Refs, Data);
+ }
+
+private:
+ InMemoryInlineObject(const InMemoryIndexValueT &I,
+ ArrayRef<const InMemoryObject *> Refs,
+ ArrayRef<char> Data)
+ : InMemoryObject(KindValue, I), NumRefs(Refs.size()),
+ DataSize(Data.size()) {
+ auto *BeginRefs = reinterpret_cast<const InMemoryObject **>(this + 1);
+ llvm::copy(Refs, BeginRefs);
+ auto *BeginData = reinterpret_cast<char *>(BeginRefs + NumRefs);
+ llvm::copy(Data, BeginData);
+ BeginData[Data.size()] = 0;
+ }
+ uint32_t NumRefs;
+ uint32_t DataSize;
+};
+
+/// In-memory CAS database and action cache (the latter should be separated).
+class InMemoryCAS : public BuiltinCAS {
+public:
+ Expected<ObjectRef> storeImpl(ArrayRef<uint8_t> ComputedHash,
+ ArrayRef<ObjectRef> Refs,
+ ArrayRef<char> Data) final;
+
+ Expected<ObjectRef>
+ storeFromNullTerminatedRegion(ArrayRef<uint8_t> ComputedHash,
+ sys::fs::mapped_file_region Map) override;
+
+ CASID getID(const InMemoryIndexValueT &I) const {
+ StringRef Hash = toStringRef(I.Hash);
+ return CASID::create(&getContext(), Hash);
+ }
+ CASID getID(const InMemoryObject &O) const { return getID(O.getIndex()); }
+
+ ObjectHandle getObjectHandle(const InMemoryObject &Node) const {
+ assert(!(reinterpret_cast<uintptr_t>(&Node) & 0x1ULL));
+ return makeObjectHandle(reinterpret_cast<uintptr_t>(&Node));
+ }
+
+ Expected<std::optional<ObjectHandle>> loadIfExists(ObjectRef Ref) override {
+ return getObjectHandle(asInMemoryObject(Ref));
+ }
+
+ InMemoryIndexValueT &indexHash(ArrayRef<uint8_t> Hash) {
+ return *Index.insertLazy(
+ Hash, [](auto ValueConstructor) { ValueConstructor.emplace(nullptr); });
+ }
+
+ /// TODO: Consider callers to actually do an insert and to return a handle to
+ /// the slot in the trie.
+ const InMemoryObject *getInMemoryObject(CASID ID) const {
+ assert(ID.getContext().getHashSchemaIdentifier() ==
+ getContext().getHashSchemaIdentifier() &&
+ "Expected ID from same hash schema");
+ if (InMemoryIndexT::const_pointer P = Index.find(ID.getHash()))
+ return P->Data;
+ return nullptr;
+ }
+
+ const InMemoryObject &getInMemoryObject(ObjectHandle OH) const {
+ return *reinterpret_cast<const InMemoryObject *>(
+ (uintptr_t)OH.getInternalRef(*this));
+ }
+
+ const InMemoryObject &asInMemoryObject(ReferenceBase Ref) const {
+ uintptr_t P = Ref.getInternalRef(*this);
+ return *reinterpret_cast<const InMemoryObject *>(P);
+ }
+ ObjectRef toReference(const InMemoryObject &O) const {
+ return makeObjectRef(reinterpret_cast<uintptr_t>(&O));
+ }
+
+ CASID getID(ObjectRef Ref) const final { return getIDImpl(Ref); }
+ CASID getIDImpl(ReferenceBase Ref) const {
+ return getID(asInMemoryObject(Ref));
+ }
+
+ std::optional<ObjectRef> getReference(const CASID &ID) const final {
+ if (const InMemoryObject *Object = getInMemoryObject(ID))
+ return toReference(*Object);
+ return std::nullopt;
+ }
+
+ Expected<bool> isMaterialized(ObjectRef Ref) const final { return true; }
+
+ ArrayRef<char> getDataConst(ObjectHandle Node) const final {
+ return cast<InMemoryObject>(asInMemoryObject(Node)).getData();
+ }
+
+ InMemoryCAS() = default;
+
+private:
+ size_t getNumRefs(ObjectHandle Node) const final {
+ return getInMemoryObject(Node).getRefs().size();
+ }
+ ObjectRef readRef(ObjectHandle Node, size_t I) const final {
+ return toReference(*getInMemoryObject(Node).getRefs()[I]);
+ }
+ Error forEachRef(ObjectHandle Node,
+ function_ref<Error(ObjectRef)> Callback) const final;
+
+ /// Index of referenced IDs (map: Hash -> InMemoryObject*). Mapped to nullptr
+ /// as a convenient way to store hashes.
+ ///
+ /// - Insert nullptr on lookups.
+ /// - InMemoryObject points back to here.
+ InMemoryIndexT Index;
+
+ ThreadSafeAllocator<BumpPtrAllocator> Objects;
+ ThreadSafeAllocator<SpecificBumpPtrAllocator<sys::fs::mapped_file_region>>
+ MemoryMaps;
+};
+
+} // end anonymous namespace
+
+ArrayRef<char> InMemoryObject::getData() const {
+ if (auto *Derived = dyn_cast<InMemoryRefObject>(this))
+ return Derived->getDataImpl();
+ return cast<InMemoryInlineObject>(this)->getDataImpl();
+}
+
+ArrayRef<const InMemoryObject *> InMemoryObject::getRefs() const {
+ if (auto *Derived = dyn_cast<InMemoryRefObject>(this))
+ return Derived->getRefsImpl();
+ return cast<InMemoryInlineObject>(this)->getRefsImpl();
+}
+
+Expected<ObjectRef>
+InMemoryCAS::storeFromNullTerminatedRegion(ArrayRef<uint8_t> ComputedHash,
+ sys::fs::mapped_file_region Map) {
+ // Look up the hash in the index, initializing to nullptr if it's new.
+ ArrayRef<char> Data(Map.data(), Map.size());
+ auto &I = indexHash(ComputedHash);
+
+ // Load or generate.
+ auto Allocator = [&](size_t Size) -> void * {
+ return Objects.Allocate(Size, alignof(InMemoryObject));
+ };
+ auto Generator = [&]() -> const InMemoryObject * {
+ return &InMemoryRefObject::create(Allocator, I, std::nullopt, Data);
+ };
+ const InMemoryObject &Node =
+ cast<InMemoryObject>(I.Data.loadOrGenerate(Generator));
+
+ // Save Map if the winning node uses it.
+ if (auto *RefNode = dyn_cast<InMemoryRefObject>(&Node))
+ if (RefNode->getData().data() == Map.data())
+ new (MemoryMaps.Allocate(1)) sys::fs::mapped_file_region(std::move(Map));
+
+ return toReference(Node);
+}
+
+Expected<ObjectRef> InMemoryCAS::storeImpl(ArrayRef<uint8_t> ComputedHash,
+ ArrayRef<ObjectRef> Refs,
+ ArrayRef<char> Data) {
+ // Look up the hash in the index, initializing to nullptr if it's new.
+ auto &I = indexHash(ComputedHash);
+
+ // Create the node.
+ SmallVector<const InMemoryObject *> InternalRefs;
+ for (ObjectRef Ref : Refs)
+ InternalRefs.push_back(&asInMemoryObject(Ref));
+ auto Allocator = [&](size_t Size) -> void * {
+ return Objects.Allocate(Size, alignof(InMemoryObject));
+ };
+ auto Generator = [&]() -> const InMemoryObject * {
+ return &InMemoryInlineObject::create(Allocator, I, InternalRefs, Data);
+ };
+ return toReference(cast<InMemoryObject>(I.Data.loadOrGenerate(Generator)));
+}
+
+Error InMemoryCAS::forEachRef(ObjectHandle Handle,
+ function_ref<Error(ObjectRef)> Callback) const {
+ auto &Node = getInMemoryObject(Handle);
+ for (const InMemoryObject *Ref : Node.getRefs())
+ if (Error E = Callback(toReference(*Ref)))
+ return E;
+ return Error::success();
+}
+
+std::unique_ptr<ObjectStore> cas::createInMemoryCAS() {
+ return std::make_unique<InMemoryCAS>();
+}
diff --git a/llvm/lib/CAS/ObjectStore.cpp b/llvm/lib/CAS/ObjectStore.cpp
new file mode 100644
index 0000000000000..a938c4e215382
--- /dev/null
+++ b/llvm/lib/CAS/ObjectStore.cpp
@@ -0,0 +1,168 @@
+//===- ObjectStore.cpp ------------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/CAS/ObjectStore.h"
+#include "llvm/ADT/DenseSet.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Support/Errc.h"
+#include "llvm/Support/FileSystem.h"
+#include "llvm/Support/MemoryBuffer.h"
+
+using namespace llvm;
+using namespace llvm::cas;
+
+void CASContext::anchor() {}
+void ObjectStore::anchor() {}
+
+LLVM_DUMP_METHOD void CASID::dump() const { print(dbgs()); }
+LLVM_DUMP_METHOD void ObjectStore::dump() const { print(dbgs()); }
+LLVM_DUMP_METHOD void ObjectRef::dump() const { print(dbgs()); }
+LLVM_DUMP_METHOD void ObjectHandle::dump() const { print(dbgs()); }
+
+std::string CASID::toString() const {
+ std::string S;
+ raw_string_ostream(S) << *this;
+ return S;
+}
+
+static void printReferenceBase(raw_ostream &OS, StringRef Kind,
+ uint64_t InternalRef, std::optional<CASID> ID) {
+ OS << Kind << "=" << InternalRef;
+ if (ID)
+ OS << "[" << *ID << "]";
+}
+
+void ReferenceBase::print(raw_ostream &OS, const ObjectHandle &This) const {
+ assert(this == &This);
+ printReferenceBase(OS, "object-handle", InternalRef, std::nullopt);
+}
+
+void ReferenceBase::print(raw_ostream &OS, const ObjectRef &This) const {
+ assert(this == &This);
+
+ std::optional<CASID> ID;
+#if LLVM_ENABLE_ABI_BREAKING_CHECKS
+ if (CAS)
+ ID = CAS->getID(This);
+#endif
+ printReferenceBase(OS, "object-ref", InternalRef, ID);
+}
+
+Expected<ObjectHandle> ObjectStore::load(ObjectRef Ref) {
+ std::optional<ObjectHandle> Handle;
+ if (Error E = loadIfExists(Ref).moveInto(Handle))
+ return std::move(E);
+ if (!Handle)
+ return createStringError(errc::invalid_argument,
+ "missing object '" + getID(Ref).toString() + "'");
+ return *Handle;
+}
+
+std::unique_ptr<MemoryBuffer>
+ObjectStore::getMemoryBuffer(ObjectHandle Node, StringRef Name,
+ bool RequiresNullTerminator) {
+ return MemoryBuffer::getMemBuffer(
+ toStringRef(getData(Node, RequiresNullTerminator)), Name,
+ RequiresNullTerminator);
+}
+
+void ObjectStore::readRefs(ObjectHandle Node,
+ SmallVectorImpl<ObjectRef> &Refs) const {
+ consumeError(forEachRef(Node, [&Refs](ObjectRef Ref) -> Error {
+ Refs.push_back(Ref);
+ return Error::success();
+ }));
+}
+
+Expected<ObjectProxy> ObjectStore::getProxy(const CASID &ID) {
+ std::optional<ObjectRef> Ref = getReference(ID);
+ if (!Ref)
+ return createUnknownObjectError(ID);
+
+ return getProxy(*Ref);
+}
+
+Expected<ObjectProxy> ObjectStore::getProxy(ObjectRef Ref) {
+ std::optional<ObjectHandle> H;
+ if (Error E = load(Ref).moveInto(H))
+ return std::move(E);
+
+ return ObjectProxy::load(*this, Ref, *H);
+}
+
+Expected<std::optional<ObjectProxy>>
+ObjectStore::getProxyIfExists(ObjectRef Ref) {
+ std::optional<ObjectHandle> H;
+ if (Error E = loadIfExists(Ref).moveInto(H))
+ return std::move(E);
+ if (!H)
+ return std::nullopt;
+ return ObjectProxy::load(*this, Ref, *H);
+}
+
+Error ObjectStore::createUnknownObjectError(const CASID &ID) {
+ return createStringError(std::make_error_code(std::errc::invalid_argument),
+ "unknown object '" + ID.toString() + "'");
+}
+
+Expected<ObjectProxy> ObjectStore::createProxy(ArrayRef<ObjectRef> Refs,
+ StringRef Data) {
+ Expected<ObjectRef> Ref = store(Refs, arrayRefFromStringRef<char>(Data));
+ if (!Ref)
+ return Ref.takeError();
+ return getProxy(*Ref);
+}
+
+Expected<ObjectRef>
+ObjectStore::storeFromOpenFileImpl(sys::fs::file_t FD,
+ std::optional<sys::fs::file_status> Status) {
+ // Copy the file into an immutable memory buffer and call \c store on that.
+ // Using \c mmap would be unsafe because there's a race window between when we
+ // get the digest hash for the \c mmap contents and when we store the data; if
+ // the file changes in-between we will create an invalid object.
+
+ // FIXME: For the on-disk CAS implementation use cloning to store it as a
+ // standalone file if the file-system supports it and the file is large.
+
+ constexpr size_t ChunkSize = 4 * 4096;
+ SmallString<0> Data;
+ Data.reserve(ChunkSize * 2);
+ if (Error E = sys::fs::readNativeFileToEOF(FD, Data, ChunkSize))
+ return std::move(E);
+ return store(std::nullopt, ArrayRef(Data.data(), Data.size()));
+}
+
+Error ObjectStore::validateTree(ObjectRef Root) {
+ SmallDenseSet<ObjectRef> ValidatedRefs;
+ SmallVector<ObjectRef, 16> RefsToValidate;
+ RefsToValidate.push_back(Root);
+
+ while (!RefsToValidate.empty()) {
+ ObjectRef Ref = RefsToValidate.pop_back_val();
+ auto [I, Inserted] = ValidatedRefs.insert(Ref);
+ if (!Inserted)
+ continue; // already validated.
+ if (Error E = validate(getID(Ref)))
+ return E;
+ Expected<ObjectHandle> Obj = load(Ref);
+ if (!Obj)
+ return Obj.takeError();
+ if (Error E = forEachRef(*Obj, [&RefsToValidate](ObjectRef R) -> Error {
+ RefsToValidate.push_back(R);
+ return Error::success();
+ }))
+ return E;
+ }
+ return Error::success();
+}
+
+std::unique_ptr<MemoryBuffer>
+ObjectProxy::getMemoryBuffer(StringRef Name,
+ bool RequiresNullTerminator) const {
+ return CAS->getMemoryBuffer(H, Name, RequiresNullTerminator);
+}
diff --git a/llvm/lib/CMakeLists.txt b/llvm/lib/CMakeLists.txt
index 503c77cb13bd0..b06f4ffd83ff5 100644
--- a/llvm/lib/CMakeLists.txt
+++ b/llvm/lib/CMakeLists.txt
@@ -9,6 +9,7 @@ add_subdirectory(FileCheck)
add_subdirectory(InterfaceStub)
add_subdirectory(IRPrinter)
add_subdirectory(IRReader)
+add_subdirectory(CAS)
add_subdirectory(CGData)
add_subdirectory(CodeGen)
add_subdirectory(CodeGenTypes)
diff --git a/llvm/unittests/CAS/CASTestConfig.cpp b/llvm/unittests/CAS/CASTestConfig.cpp
new file mode 100644
index 0000000000000..bb06ee5573134
--- /dev/null
+++ b/llvm/unittests/CAS/CASTestConfig.cpp
@@ -0,0 +1,22 @@
+//===- CASTestConfig.cpp --------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "CASTestConfig.h"
+#include "llvm/CAS/ObjectStore.h"
+#include "gtest/gtest.h"
+
+using namespace llvm;
+using namespace llvm::cas;
+
+CASTestingEnv createInMemory(int I) {
+ std::unique_ptr<ObjectStore> CAS = createInMemoryCAS();
+ return CASTestingEnv{std::move(CAS)};
+}
+
+INSTANTIATE_TEST_SUITE_P(InMemoryCAS, CASTest,
+ ::testing::Values(createInMemory));
diff --git a/llvm/unittests/CAS/CASTestConfig.h b/llvm/unittests/CAS/CASTestConfig.h
new file mode 100644
index 0000000000000..d9f9e52033c2d
--- /dev/null
+++ b/llvm/unittests/CAS/CASTestConfig.h
@@ -0,0 +1,32 @@
+//===- CASTestConfig.h ----------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/CAS/ObjectStore.h"
+#include "gtest/gtest.h"
+
+#ifndef LLVM_UNITTESTS_CASTESTCONFIG_H
+#define LLVM_UNITTESTS_CASTESTCONFIG_H
+
+struct CASTestingEnv {
+ std::unique_ptr<llvm::cas::ObjectStore> CAS;
+};
+
+class CASTest
+ : public testing::TestWithParam<std::function<CASTestingEnv(int)>> {
+protected:
+ std::optional<int> NextCASIndex;
+
+ std::unique_ptr<llvm::cas::ObjectStore> createObjectStore() {
+ auto TD = GetParam()(++(*NextCASIndex));
+ return std::move(TD.CAS);
+ }
+ void SetUp() { NextCASIndex = 0; }
+ void TearDown() { NextCASIndex = std::nullopt; }
+};
+
+#endif
diff --git a/llvm/unittests/CAS/CMakeLists.txt b/llvm/unittests/CAS/CMakeLists.txt
new file mode 100644
index 0000000000000..39a2100c4909e
--- /dev/null
+++ b/llvm/unittests/CAS/CMakeLists.txt
@@ -0,0 +1,12 @@
+set(LLVM_LINK_COMPONENTS
+ Support
+ CAS
+ TestingSupport
+ )
+
+add_llvm_unittest(CASTests
+ CASTestConfig.cpp
+ ObjectStoreTest.cpp
+ )
+
+target_link_libraries(CASTests PRIVATE LLVMTestingSupport)
diff --git a/llvm/unittests/CAS/ObjectStoreTest.cpp b/llvm/unittests/CAS/ObjectStoreTest.cpp
new file mode 100644
index 0000000000000..0d94731330b1d
--- /dev/null
+++ b/llvm/unittests/CAS/ObjectStoreTest.cpp
@@ -0,0 +1,360 @@
+//===- ObjectStoreTest.cpp ------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/CAS/ObjectStore.h"
+#include "llvm/Support/Process.h"
+#include "llvm/Support/ThreadPool.h"
+#include "llvm/Testing/Support/Error.h"
+#include "gtest/gtest.h"
+
+#include "CASTestConfig.h"
+
+using namespace llvm;
+using namespace llvm::cas;
+
+TEST_P(CASTest, PrintIDs) {
+ std::unique_ptr<ObjectStore> CAS = createObjectStore();
+
+ std::optional<CASID> ID1, ID2;
+ ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, "1").moveInto(ID1),
+ Succeeded());
+ ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, "2").moveInto(ID2),
+ Succeeded());
+ EXPECT_NE(ID1, ID2);
+ std::string PrintedID1 = ID1->toString();
+ std::string PrintedID2 = ID2->toString();
+ EXPECT_NE(PrintedID1, PrintedID2);
+
+ std::optional<CASID> ParsedID1, ParsedID2;
+ ASSERT_THAT_ERROR(CAS->parseID(PrintedID1).moveInto(ParsedID1), Succeeded());
+ ASSERT_THAT_ERROR(CAS->parseID(PrintedID2).moveInto(ParsedID2), Succeeded());
+ EXPECT_EQ(ID1, ParsedID1);
+ EXPECT_EQ(ID2, ParsedID2);
+}
+
+TEST_P(CASTest, Blobs) {
+ std::unique_ptr<ObjectStore> CAS1 = createObjectStore();
+ StringRef ContentStrings[] = {
+ "word",
+ "some longer text std::string's local memory",
+ R"(multiline text multiline text multiline text multiline text
+multiline text multiline text multiline text multiline text multiline text
+multiline text multiline text multiline text multiline text multiline text
+multiline text multiline text multiline text multiline text multiline text
+multiline text multiline text multiline text multiline text multiline text
+multiline text multiline text multiline text multiline text multiline text)",
+ };
+
+ SmallVector<CASID> IDs;
+ for (StringRef Content : ContentStrings) {
+ // Use StringRef::str() to create a temporary std::string. This could cause
+ // problems if the CAS is storing references to the input string instead of
+ // copying it.
+ std::optional<ObjectProxy> Blob;
+ ASSERT_THAT_ERROR(CAS1->createProxy(std::nullopt, Content).moveInto(Blob),
+ Succeeded());
+ IDs.push_back(Blob->getID());
+
+ // Check basic printing of IDs.
+ EXPECT_EQ(IDs.back().toString(), IDs.back().toString());
+ if (IDs.size() > 2)
+ EXPECT_NE(IDs.front().toString(), IDs.back().toString());
+ }
+
+ // Check that the blobs give the same IDs later.
+ for (int I = 0, E = IDs.size(); I != E; ++I) {
+ std::optional<ObjectProxy> Blob;
+ ASSERT_THAT_ERROR(
+ CAS1->createProxy(std::nullopt, ContentStrings[I]).moveInto(Blob),
+ Succeeded());
+ EXPECT_EQ(IDs[I], Blob->getID());
+ }
+
+ // Run validation on all CASIDs.
+ for (int I = 0, E = IDs.size(); I != E; ++I)
+ ASSERT_THAT_ERROR(CAS1->validate(IDs[I]), Succeeded());
+
+ // Check that the blobs can be retrieved multiple times.
+ for (int I = 0, E = IDs.size(); I != E; ++I) {
+ for (int J = 0, JE = 3; J != JE; ++J) {
+ std::optional<ObjectProxy> Buffer;
+ ASSERT_THAT_ERROR(CAS1->getProxy(IDs[I]).moveInto(Buffer), Succeeded());
+ EXPECT_EQ(ContentStrings[I], Buffer->getData());
+ }
+ }
+
+ // Confirm these blobs don't exist in a fresh CAS instance.
+ std::unique_ptr<ObjectStore> CAS2 = createObjectStore();
+ for (int I = 0, E = IDs.size(); I != E; ++I) {
+ std::optional<ObjectProxy> Proxy;
+ EXPECT_THAT_ERROR(CAS2->getProxy(IDs[I]).moveInto(Proxy), Failed());
+ }
+
+ // Insert into the second CAS and confirm the IDs are stable. Getting them
+ // should work now.
+ for (int I = IDs.size(), E = 0; I != E; --I) {
+ auto &ID = IDs[I - 1];
+ auto &Content = ContentStrings[I - 1];
+ std::optional<ObjectProxy> Blob;
+ ASSERT_THAT_ERROR(CAS2->createProxy(std::nullopt, Content).moveInto(Blob),
+ Succeeded());
+ EXPECT_EQ(ID, Blob->getID());
+
+ std::optional<ObjectProxy> Buffer;
+ ASSERT_THAT_ERROR(CAS2->getProxy(ID).moveInto(Buffer), Succeeded());
+ EXPECT_EQ(Content, Buffer->getData());
+ }
+}
+
+TEST_P(CASTest, BlobsBig) {
+ // A little bit of validation that bigger blobs are okay. Climb up to 1MB.
+ std::unique_ptr<ObjectStore> CAS = createObjectStore();
+ SmallString<256> String1 = StringRef("a few words");
+ SmallString<256> String2 = StringRef("others");
+ while (String1.size() < 1024U * 1024U) {
+ std::optional<CASID> ID1;
+ std::optional<CASID> ID2;
+ ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, String1).moveInto(ID1),
+ Succeeded());
+ ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, String1).moveInto(ID2),
+ Succeeded());
+ ASSERT_THAT_ERROR(CAS->validate(*ID1), Succeeded());
+ ASSERT_THAT_ERROR(CAS->validate(*ID2), Succeeded());
+ ASSERT_EQ(ID1, ID2);
+
+ String1.append(String2);
+ ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, String2).moveInto(ID1),
+ Succeeded());
+ ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, String2).moveInto(ID2),
+ Succeeded());
+ ASSERT_THAT_ERROR(CAS->validate(*ID1), Succeeded());
+ ASSERT_THAT_ERROR(CAS->validate(*ID2), Succeeded());
+ ASSERT_EQ(ID1, ID2);
+ String2.append(String1);
+ }
+
+ // Specifically check near 1MB for objects large enough they're likely to be
+ // stored externally in an on-disk CAS and will be near a page boundary.
+ SmallString<0> Storage;
+ const size_t InterestingSize = 1024U * 1024ULL;
+ const size_t SizeE = InterestingSize + 2;
+ if (Storage.size() < SizeE)
+ Storage.resize(SizeE, '\01');
+ for (size_t Size = InterestingSize - 2; Size != SizeE; ++Size) {
+ StringRef Data(Storage.data(), Size);
+ std::optional<ObjectProxy> Blob;
+ ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, Data).moveInto(Blob),
+ Succeeded());
+ ASSERT_EQ(Data, Blob->getData());
+ ASSERT_EQ(0, Blob->getData().end()[0]);
+ }
+}
+
+TEST_P(CASTest, LeafNodes) {
+ std::unique_ptr<ObjectStore> CAS1 = createObjectStore();
+ StringRef ContentStrings[] = {
+ "word",
+ "some longer text std::string's local memory",
+ R"(multiline text multiline text multiline text multiline text
+multiline text multiline text multiline text multiline text multiline text
+multiline text multiline text multiline text multiline text multiline text
+multiline text multiline text multiline text multiline text multiline text
+multiline text multiline text multiline text multiline text multiline text
+multiline text multiline text multiline text multiline text multiline text)",
+ };
+
+ SmallVector<ObjectRef> Nodes;
+ SmallVector<CASID> IDs;
+ for (StringRef Content : ContentStrings) {
+ // Use StringRef::str() to create a temporary std::string. This could cause
+ // problems if the CAS is storing references to the input string instead of
+ // copying it.
+ std::optional<ObjectRef> Node;
+ ASSERT_THAT_ERROR(
+ CAS1->store(std::nullopt, arrayRefFromStringRef<char>(Content))
+ .moveInto(Node),
+ Succeeded());
+ Nodes.push_back(*Node);
+
+ // Check basic printing of IDs.
+ IDs.push_back(CAS1->getID(*Node));
+ EXPECT_EQ(IDs.back().toString(), IDs.back().toString());
+ EXPECT_EQ(Nodes.front(), Nodes.front());
+ EXPECT_EQ(Nodes.back(), Nodes.back());
+ EXPECT_EQ(IDs.front(), IDs.front());
+ EXPECT_EQ(IDs.back(), IDs.back());
+ if (Nodes.size() <= 1)
+ continue;
+ EXPECT_NE(Nodes.front(), Nodes.back());
+ EXPECT_NE(IDs.front(), IDs.back());
+ }
+
+ // Check that the blobs give the same IDs later.
+ for (int I = 0, E = IDs.size(); I != E; ++I) {
+ std::optional<ObjectRef> Node;
+ ASSERT_THAT_ERROR(CAS1->store(std::nullopt, arrayRefFromStringRef<char>(
+ ContentStrings[I]))
+ .moveInto(Node),
+ Succeeded());
+ EXPECT_EQ(IDs[I], CAS1->getID(*Node));
+ }
+
+ // Check that the blobs can be retrieved multiple times.
+ for (int I = 0, E = IDs.size(); I != E; ++I) {
+ for (int J = 0, JE = 3; J != JE; ++J) {
+ std::optional<ObjectProxy> Object;
+ ASSERT_THAT_ERROR(CAS1->getProxy(IDs[I]).moveInto(Object), Succeeded());
+ ASSERT_TRUE(Object);
+ EXPECT_EQ(ContentStrings[I], Object->getData());
+ }
+ }
+
+ // Confirm these blobs don't exist in a fresh CAS instance.
+ std::unique_ptr<ObjectStore> CAS2 = createObjectStore();
+ for (int I = 0, E = IDs.size(); I != E; ++I) {
+ std::optional<ObjectProxy> Object;
+ EXPECT_THAT_ERROR(CAS2->getProxy(IDs[I]).moveInto(Object), Failed());
+ }
+
+ // Insert into the second CAS and confirm the IDs are stable. Getting them
+ // should work now.
+ for (int I = IDs.size(), E = 0; I != E; --I) {
+ auto &ID = IDs[I - 1];
+ auto &Content = ContentStrings[I - 1];
+ std::optional<ObjectRef> Node;
+ ASSERT_THAT_ERROR(
+ CAS2->store(std::nullopt, arrayRefFromStringRef<char>(Content))
+ .moveInto(Node),
+ Succeeded());
+ EXPECT_EQ(ID, CAS2->getID(*Node));
+
+ std::optional<ObjectProxy> Object;
+ ASSERT_THAT_ERROR(CAS2->getProxy(ID).moveInto(Object), Succeeded());
+ ASSERT_TRUE(Object);
+ EXPECT_EQ(Content, Object->getData());
+ }
+}
+
+TEST_P(CASTest, NodesBig) {
+ std::unique_ptr<ObjectStore> CAS = createObjectStore();
+
+ // Specifically check near 1MB for objects large enough they're likely to be
+ // stored externally in an on-disk CAS, and such that one of them will be
+ // near a page boundary.
+ SmallString<0> Storage;
+ constexpr size_t InterestingSize = 1024U * 1024ULL;
+ constexpr size_t WordSize = sizeof(void *);
+
+ // Start much smaller to account for headers.
+ constexpr size_t SizeB = InterestingSize - 8 * WordSize;
+ constexpr size_t SizeE = InterestingSize + 1;
+ if (Storage.size() < SizeE)
+ Storage.resize(SizeE, '\01');
+
+ SmallVector<ObjectRef, 4> CreatedNodes;
+ // Avoid checking every size because this is an expensive test. Just check
+ // for data that is 8B-word-aligned, and one less. Also appending the created
+ // nodes as the references in the next block to check references are created
+ // correctly.
+ for (size_t Size = SizeB; Size < SizeE; Size += WordSize) {
+ for (bool IsAligned : {false, true}) {
+ StringRef Data(Storage.data(), Size - (IsAligned ? 0 : 1));
+ std::optional<ObjectProxy> Node;
+ ASSERT_THAT_ERROR(CAS->createProxy(CreatedNodes, Data).moveInto(Node),
+ Succeeded());
+ ASSERT_EQ(Data, Node->getData());
+ ASSERT_EQ(0, Node->getData().end()[0]);
+ ASSERT_EQ(Node->getNumReferences(), CreatedNodes.size());
+ CreatedNodes.emplace_back(Node->getRef());
+ }
+ }
+
+ for (auto ID : CreatedNodes)
+ ASSERT_THAT_ERROR(CAS->validate(CAS->getID(ID)), Succeeded());
+}
+
+/// Common test functionality for creating blobs in parallel. You can vary which
+/// cas instances are the same or different, and the size of the created blobs.
+static void testBlobsParallel(ObjectStore &Read1, ObjectStore &Read2,
+ ObjectStore &Write1, ObjectStore &Write2,
+ uint64_t BlobSize) {
+ SCOPED_TRACE(testBlobsParallel);
+ unsigned BlobCount = 100;
+ std::vector<std::string> Blobs;
+ Blobs.reserve(BlobCount);
+ for (unsigned I = 0; I < BlobCount; ++I) {
+ std::string Blob;
+ Blob.reserve(BlobSize);
+ while (Blob.size() < BlobSize) {
+ auto R = sys::Process::GetRandomNumber();
+ Blob.append((char *)&R, sizeof(R));
+ }
+ assert(Blob.size() >= BlobSize);
+ Blob.resize(BlobSize);
+ Blobs.push_back(std::move(Blob));
+ }
+
+ std::mutex NodesMtx;
+ std::vector<std::optional<CASID>> CreatedNodes(BlobCount);
+
+ auto Producer = [&](unsigned I, ObjectStore *CAS) {
+ std::optional<ObjectProxy> Node;
+ EXPECT_THAT_ERROR(CAS->createProxy({}, Blobs[I]).moveInto(Node),
+ Succeeded());
+ {
+ std::lock_guard<std::mutex> L(NodesMtx);
+ CreatedNodes[I] = Node ? Node->getID() : CASID::getDenseMapTombstoneKey();
+ }
+ };
+
+ auto Consumer = [&](unsigned I, ObjectStore *CAS) {
+ std::optional<CASID> ID;
+ while (!ID) {
+ // Busy wait.
+ std::lock_guard<std::mutex> L(NodesMtx);
+ ID = CreatedNodes[I];
+ }
+ if (ID == CASID::getDenseMapTombstoneKey())
+ // Producer failed; already reported.
+ return;
+
+ std::optional<ObjectProxy> Node;
+ ASSERT_THAT_ERROR(CAS->getProxy(*ID).moveInto(Node), Succeeded());
+ EXPECT_EQ(Node->getData(), Blobs[I]);
+ };
+
+ DefaultThreadPool Threads;
+ for (unsigned I = 0; I < BlobCount; ++I) {
+ Threads.async(Consumer, I, &Read1);
+ Threads.async(Consumer, I, &Read2);
+ Threads.async(Producer, I, &Write1);
+ Threads.async(Producer, I, &Write2);
+ }
+
+ Threads.wait();
+}
+
+static void testBlobsParallel1(ObjectStore &CAS, uint64_t BlobSize) {
+ SCOPED_TRACE(testBlobsParallel1);
+ testBlobsParallel(CAS, CAS, CAS, CAS, BlobSize);
+}
+
+TEST_P(CASTest, BlobsParallel) {
+ std::shared_ptr<ObjectStore> CAS = createObjectStore();
+ uint64_t Size = 1ULL * 1024;
+ ASSERT_NO_FATAL_FAILURE(testBlobsParallel1(*CAS, Size));
+}
+
+#ifdef EXPENSIVE_CHECKS
+TEST_P(CASTest, BlobsBigParallel) {
+ std::shared_ptr<ObjectStore> CAS = createObjectStore();
+ // 100k is large enough to be standalone files in our on-disk cas.
+ uint64_t Size = 100ULL * 1024;
+ ASSERT_NO_FATAL_FAILURE(testBlobsParallel1(*CAS, Size));
+}
+#endif
diff --git a/llvm/unittests/CMakeLists.txt b/llvm/unittests/CMakeLists.txt
index 8892f3e75729a..5ebdc3bb4cac1 100644
--- a/llvm/unittests/CMakeLists.txt
+++ b/llvm/unittests/CMakeLists.txt
@@ -34,6 +34,7 @@ add_subdirectory(AsmParser)
add_subdirectory(BinaryFormat)
add_subdirectory(Bitcode)
add_subdirectory(Bitstream)
+add_subdirectory(CAS)
add_subdirectory(CGData)
add_subdirectory(CodeGen)
add_subdirectory(DebugInfo)
>From ee98c85d7f5274a7e0b86cc839cc9d0ad5a1e05f Mon Sep 17 00:00:00 2001
From: Steven Wu <stevenwu at apple.com>
Date: Wed, 30 Oct 2024 14:54:44 -0700
Subject: [PATCH 2/5] Address review feedback
Created using spr 1.3.5
---
llvm/docs/ContentAddressableStorage.md | 55 +++++++++++++-------------
llvm/include/llvm/CAS/CASReference.h | 14 +------
llvm/lib/CAS/InMemoryCAS.cpp | 23 ++++++-----
llvm/lib/CAS/ObjectStore.cpp | 20 ++++------
4 files changed, 50 insertions(+), 62 deletions(-)
diff --git a/llvm/docs/ContentAddressableStorage.md b/llvm/docs/ContentAddressableStorage.md
index 4f2d9a6a3a918..1cd788382c653 100644
--- a/llvm/docs/ContentAddressableStorage.md
+++ b/llvm/docs/ContentAddressableStorage.md
@@ -6,8 +6,8 @@ Content Addressable Storage, or `CAS`, is a storage system where it assigns
unique addresses to the data stored. It is very useful for data deduplicaton
and creating unique identifiers.
-Unlikely other kind of storage system like file system, CAS is immutable. It
-is more reliable to model a computation when representing the inputs and outputs
+Unlike other kinds of storage system like a file system, CAS is immutable. It
+is more reliable to model a computation by representing the inputs and outputs
of the computation using objects stored in CAS.
The basic unit of the CAS library is a CASObject, where it contains:
@@ -24,11 +24,10 @@ struct CASObject {
}
```
-Such abstraction can allow simple composition of CASObjects into a DAG to
-represent complicated data structure while still allowing data deduplication.
-Note you can compare two DAGs by just comparing the CASObject hash of two
-root nodes.
-
+With this abstraction, it is possible to compose CASObjects into a DAG that is
+capable of representing complicated data structures, while still allowing data
+deduplication. Note you can compare two DAGs by just comparing the CASObject
+hash of two root nodes.
## LLVM CAS Library User Guide
@@ -47,11 +46,11 @@ along. It has following properties:
`ObjectRef` created by different `ObjectStore` cannot be cross-referenced or
compared.
* `ObjectRef` doesn't guarantee the existence of the CASObject it points to. An
-explicitly load is required before accessing the data stored in CASObject.
-This load can also fail, for reasons like but not limited to: object does
+explicit load is required before accessing the data stored in CASObject.
+This load can also fail, for reasons like (but not limited to): object does
not exist, corrupted CAS storage, operation timeout, etc.
-* If two `ObjectRef` are equal, it is guarantee that the object they point to
-(if exists) are identical. If they are not equal, the underlying objects are
+* If two `ObjectRef` are equal, it is guaranteed that the object they point to
+are identical (if they exist). If they are not equal, the underlying objects are
guaranteed to be not the same.
### ObjectProxy
@@ -88,33 +87,33 @@ It also provides APIs to convert between `ObjectRef`, `ObjectProxy` and
## CAS Library Implementation Guide
-The LLVM ObjectStore APIs are designed so that it is easy to add
-customized CAS implementation that are interchangeable with builtin
-CAS implementations.
+The LLVM ObjectStore API was designed so that it is easy to add
+customized CAS implementations that are interchangeable with the builtin
+ones.
To add your own implementation, you just need to add a subclass to
`llvm::cas::ObjectStore` and implement all its pure virtual methods.
To be interchangeable with LLVM ObjectStore, the new CAS implementation
needs to conform to following contracts:
-* Different CASObject stored in the ObjectStore needs to have a different hash
-and result in a different `ObjectRef`. Vice versa, same CASObject should have
-same hash and same `ObjectRef`. Note two different CASObjects with identical
-data but different references are considered different objects.
-* `ObjectRef`s are comparable within the same `ObjectStore` instance, and can
-be used to determine the equality of the underlying CASObjects.
-* The loaded objects from the ObjectStore need to have the lifetime to be at
-least as long as the ObjectStore itself.
+* Different CASObjects stored in the ObjectStore need to have a different hash
+and result in a different `ObjectRef`. Similarly, the same CASObject should have
+the same hash and the same `ObjectRef`. Note: two different CASObjects with
+identical data but different references are considered different objects.
+* `ObjectRef`s are only comparable within the same `ObjectStore` instance, and
+can be used to determine the equality of the underlying CASObjects.
+* The loaded objects from the ObjectStore need to have a lifetime at least as
+long as the ObjectStore itself.
If not specified, the behavior can be implementation defined. For example,
`ObjectRef` can be used to point to a loaded CASObject so
`ObjectStore` never fails to load. It is also legal to use a stricter model
-than required. For example, an `ObjectRef` that can be used to compare
-objects between different `ObjectStore` instances is legal but user
-of the ObjectStore should not depend on this behavior.
+than required. For example, an `ObjectRef` can be an unique indentity of
+the objects across multiple `ObjectStore` instances but users of the LLVMCAS
+should not depend on this behavior.
-For CAS library implementer, there is also a `ObjectHandle` class that
+For CAS library implementers, there is also an `ObjectHandle` class that
is an internal representation of a loaded CASObject reference.
-`ObjectProxy` is just a pair of `ObjectHandle` and `ObjectStore`, because
+`ObjectProxy` is just a pair of `ObjectHandle` and `ObjectStore`, and
just like `ObjectRef`, `ObjectHandle` is only useful when paired with
-the ObjectStore that knows about the loaded CASObject.
+the `ObjectStore` that knows about the loaded CASObject.
diff --git a/llvm/include/llvm/CAS/CASReference.h b/llvm/include/llvm/CAS/CASReference.h
index 1f435cf306c4c..e41c04ca2655d 100644
--- a/llvm/include/llvm/CAS/CASReference.h
+++ b/llvm/include/llvm/CAS/CASReference.h
@@ -89,7 +89,7 @@ class ReferenceBase {
#endif
};
-/// Reference to an object in a \a ObjectStore instance.
+/// Reference to an object in an \a ObjectStore instance.
///
/// If you have an ObjectRef, you know the object exists, and you can point at
/// it from new nodes with \a ObjectStore::store(), but you don't know anything
@@ -105,12 +105,6 @@ class ReferenceBase {
/// ObjectHandle, a variant that knows what kind of entity it is. \a
/// ObjectStore::getReferenceKind() can expect the type of reference without
/// asking for unloaded objects to be loaded.
-///
-/// This is a wrapper around a \c uint64_t (and a \a ObjectStore instance when
-/// assertions are on). If necessary, it can be deconstructed and reconstructed
-/// using \a Reference::getInternalRef() and \a
-/// Reference::getFromInternalRef(), but clients aren't expected to need to do
-/// this. These both require the right \a ObjectStore instance.
class ObjectRef : public ReferenceBase {
struct DenseMapTag {};
@@ -122,12 +116,6 @@ class ObjectRef : public ReferenceBase {
return !(LHS == RHS);
}
- /// Allow a reference to be recreated after it's deconstructed.
- static ObjectRef getFromInternalRef(const ObjectStore &CAS,
- uint64_t InternalRef) {
- return ObjectRef(CAS, InternalRef);
- }
-
static ObjectRef getDenseMapEmptyKey() {
return ObjectRef(DenseMapEmptyTag{});
}
diff --git a/llvm/lib/CAS/InMemoryCAS.cpp b/llvm/lib/CAS/InMemoryCAS.cpp
index abdd7ed3ef805..f0305e0d4eafa 100644
--- a/llvm/lib/CAS/InMemoryCAS.cpp
+++ b/llvm/lib/CAS/InMemoryCAS.cpp
@@ -13,6 +13,7 @@
#include "llvm/Support/Allocator.h"
#include "llvm/Support/Casting.h"
#include "llvm/Support/ThreadSafeAllocator.h"
+#include "llvm/Support/TrailingObjects.h"
using namespace llvm;
using namespace llvm::cas;
@@ -69,12 +70,12 @@ class InMemoryObject {
static_assert(((int)Kind::Max >> NumKindBits) == 0, "Kind will be truncated");
public:
- inline ArrayRef<char> getData() const;
+ ArrayRef<char> getData() const;
- inline ArrayRef<const InMemoryObject *> getRefs() const;
+ ArrayRef<const InMemoryObject *> getRefs() const;
};
-class InMemoryRefObject : public InMemoryObject {
+class InMemoryRefObject final : public InMemoryObject {
public:
static constexpr Kind KindValue = Kind::RefNode;
static bool classof(const InMemoryObject *O) {
@@ -107,7 +108,10 @@ class InMemoryRefObject : public InMemoryObject {
ArrayRef<char> Data;
};
-class InMemoryInlineObject : public InMemoryObject {
+class InMemoryInlineObject final
+ : public InMemoryObject,
+ public TrailingObjects<InMemoryInlineObject, const InMemoryObject *,
+ char> {
public:
static constexpr Kind KindValue = Kind::InlineNode;
static bool classof(const InMemoryObject *O) {
@@ -116,15 +120,12 @@ class InMemoryInlineObject : public InMemoryObject {
ArrayRef<const InMemoryObject *> getRefs() const { return getRefsImpl(); }
ArrayRef<const InMemoryObject *> getRefsImpl() const {
- return ArrayRef(reinterpret_cast<const InMemoryObject *const *>(this + 1),
- NumRefs);
+ return ArrayRef(getTrailingObjects<const InMemoryObject *>(), NumRefs);
}
ArrayRef<char> getData() const { return getDataImpl(); }
ArrayRef<char> getDataImpl() const {
- ArrayRef<const InMemoryObject *> Refs = getRefs();
- return ArrayRef(reinterpret_cast<const char *>(Refs.data() + Refs.size()),
- DataSize);
+ return ArrayRef(getTrailingObjects<char>(), DataSize);
}
static InMemoryInlineObject &
@@ -136,6 +137,10 @@ class InMemoryInlineObject : public InMemoryObject {
return *new (Mem) InMemoryInlineObject(I, Refs, Data);
}
+ size_t numTrailingObjects(OverloadToken<const InMemoryObject *>) const {
+ return NumRefs;
+ }
+
private:
InMemoryInlineObject(const InMemoryIndexValueT &I,
ArrayRef<const InMemoryObject *> Refs,
diff --git a/llvm/lib/CAS/ObjectStore.cpp b/llvm/lib/CAS/ObjectStore.cpp
index a938c4e215382..179621cfa296c 100644
--- a/llvm/lib/CAS/ObjectStore.cpp
+++ b/llvm/lib/CAS/ObjectStore.cpp
@@ -12,6 +12,7 @@
#include "llvm/Support/Errc.h"
#include "llvm/Support/FileSystem.h"
#include "llvm/Support/MemoryBuffer.h"
+#include <optional>
using namespace llvm;
using namespace llvm::cas;
@@ -121,20 +122,15 @@ Expected<ObjectProxy> ObjectStore::createProxy(ArrayRef<ObjectRef> Refs,
Expected<ObjectRef>
ObjectStore::storeFromOpenFileImpl(sys::fs::file_t FD,
std::optional<sys::fs::file_status> Status) {
- // Copy the file into an immutable memory buffer and call \c store on that.
- // Using \c mmap would be unsafe because there's a race window between when we
- // get the digest hash for the \c mmap contents and when we store the data; if
- // the file changes in-between we will create an invalid object.
-
- // FIXME: For the on-disk CAS implementation use cloning to store it as a
+ // TODO: For the on-disk CAS implementation use cloning to store it as a
// standalone file if the file-system supports it and the file is large.
+ uint64_t Size = Status ? Status->getSize() : -1;
+ auto Buffer = MemoryBuffer::getOpenFile(FD, /*Filename=*/"", Size);
+ if (Buffer)
+ return errorCodeToError(Buffer.getError());
- constexpr size_t ChunkSize = 4 * 4096;
- SmallString<0> Data;
- Data.reserve(ChunkSize * 2);
- if (Error E = sys::fs::readNativeFileToEOF(FD, Data, ChunkSize))
- return std::move(E);
- return store(std::nullopt, ArrayRef(Data.data(), Data.size()));
+ return store(std::nullopt,
+ arrayRefFromStringRef<char>((*Buffer)->getBuffer()));
}
Error ObjectStore::validateTree(ObjectRef Root) {
>From 31f6f78c9c4fc4d125395781d03b1053b1a571a8 Mon Sep 17 00:00:00 2001
From: Steven Wu <stevenwu at apple.com>
Date: Wed, 30 Oct 2024 15:32:47 -0700
Subject: [PATCH 3/5] More review feedback for document
Created using spr 1.3.5
---
llvm/docs/ContentAddressableStorage.md | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/llvm/docs/ContentAddressableStorage.md b/llvm/docs/ContentAddressableStorage.md
index 1cd788382c653..deb11101460a6 100644
--- a/llvm/docs/ContentAddressableStorage.md
+++ b/llvm/docs/ContentAddressableStorage.md
@@ -103,14 +103,16 @@ identical data but different references are considered different objects.
* `ObjectRef`s are only comparable within the same `ObjectStore` instance, and
can be used to determine the equality of the underlying CASObjects.
* The loaded objects from the ObjectStore need to have a lifetime at least as
-long as the ObjectStore itself.
+long as the ObjectStore itself so it is always legal to access the loaded data
+without holding on the `ObjectProxy` until the `ObjectStore` is destroyed.
+
If not specified, the behavior can be implementation defined. For example,
`ObjectRef` can be used to point to a loaded CASObject so
`ObjectStore` never fails to load. It is also legal to use a stricter model
-than required. For example, an `ObjectRef` can be an unique indentity of
-the objects across multiple `ObjectStore` instances but users of the LLVMCAS
-should not depend on this behavior.
+than required. For example, the underlying value inside `ObjectRef` can be
+the unique indentities of the objects across multiple `ObjectStore` instances,
+but comparing such `ObjectRef` from different `ObjectStore` is still illegal.
For CAS library implementers, there is also an `ObjectHandle` class that
is an internal representation of a loaded CASObject reference.
>From 938db4a4f102f9d37c2ed0774e8293e38a5ca4df Mon Sep 17 00:00:00 2001
From: Steven Wu <stevenwu at apple.com>
Date: Fri, 1 Nov 2024 13:24:43 -0700
Subject: [PATCH 4/5] Address review feedback
Created using spr 1.3.5
---
llvm/include/llvm/CAS/ObjectStore.h | 4 ----
llvm/lib/CAS/ObjectStore.cpp | 1 -
llvm/unittests/CAS/ObjectStoreTest.cpp | 9 ++++-----
3 files changed, 4 insertions(+), 10 deletions(-)
diff --git a/llvm/include/llvm/CAS/ObjectStore.h b/llvm/include/llvm/CAS/ObjectStore.h
index b4720c7edc154..b562ea9815c34 100644
--- a/llvm/include/llvm/CAS/ObjectStore.h
+++ b/llvm/include/llvm/CAS/ObjectStore.h
@@ -217,10 +217,6 @@ class ObjectStore {
/// Validate the whole node tree.
Error validateTree(ObjectRef Ref);
- /// Print the ObjectStore internals for debugging purpose.
- virtual void print(raw_ostream &) const {}
- void dump() const;
-
/// Get CASContext
const CASContext &getContext() const { return Context; }
diff --git a/llvm/lib/CAS/ObjectStore.cpp b/llvm/lib/CAS/ObjectStore.cpp
index 179621cfa296c..8c7a969232f15 100644
--- a/llvm/lib/CAS/ObjectStore.cpp
+++ b/llvm/lib/CAS/ObjectStore.cpp
@@ -21,7 +21,6 @@ void CASContext::anchor() {}
void ObjectStore::anchor() {}
LLVM_DUMP_METHOD void CASID::dump() const { print(dbgs()); }
-LLVM_DUMP_METHOD void ObjectStore::dump() const { print(dbgs()); }
LLVM_DUMP_METHOD void ObjectRef::dump() const { print(dbgs()); }
LLVM_DUMP_METHOD void ObjectHandle::dump() const { print(dbgs()); }
diff --git a/llvm/unittests/CAS/ObjectStoreTest.cpp b/llvm/unittests/CAS/ObjectStoreTest.cpp
index 0d94731330b1d..1a7446d322c00 100644
--- a/llvm/unittests/CAS/ObjectStoreTest.cpp
+++ b/llvm/unittests/CAS/ObjectStoreTest.cpp
@@ -183,11 +183,10 @@ multiline text multiline text multiline text multiline text multiline text)",
// Check basic printing of IDs.
IDs.push_back(CAS1->getID(*Node));
- EXPECT_EQ(IDs.back().toString(), IDs.back().toString());
- EXPECT_EQ(Nodes.front(), Nodes.front());
- EXPECT_EQ(Nodes.back(), Nodes.back());
- EXPECT_EQ(IDs.front(), IDs.front());
- EXPECT_EQ(IDs.back(), IDs.back());
+ auto ID = CAS1->getID(Nodes.back());
+ EXPECT_EQ(ID.toString(), IDs.back().toString());
+ EXPECT_EQ(*Node, Nodes.back());
+ EXPECT_EQ(ID, IDs.back());
if (Nodes.size() <= 1)
continue;
EXPECT_NE(Nodes.front(), Nodes.back());
>From f3b0eecc1f892691cb917a54308104a99c235f28 Mon Sep 17 00:00:00 2001
From: Steven Wu <stevenwu at apple.com>
Date: Fri, 8 Aug 2025 09:03:12 -0700
Subject: [PATCH 5/5] address more review feedback
Created using spr 1.3.6
---
llvm/docs/ContentAddressableStorage.md | 6 +++---
llvm/include/llvm/CAS/BuiltinCASContext.h | 1 +
llvm/include/llvm/CAS/BuiltinObjectHasher.h | 1 +
llvm/lib/CAS/BuiltinCAS.h | 1 +
llvm/lib/CAS/InMemoryCAS.cpp | 1 +
5 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/llvm/docs/ContentAddressableStorage.md b/llvm/docs/ContentAddressableStorage.md
index deb11101460a6..d252f36491c51 100644
--- a/llvm/docs/ContentAddressableStorage.md
+++ b/llvm/docs/ContentAddressableStorage.md
@@ -2,11 +2,11 @@
## Introduction to CAS
-Content Addressable Storage, or `CAS`, is a storage system where it assigns
+Content Addressable Storage, or `CAS`, is a storage system that assigns
unique addresses to the data stored. It is very useful for data deduplicaton
and creating unique identifiers.
-Unlike other kinds of storage system like a file system, CAS is immutable. It
+Unlike other kinds of storage systems like file systems, CAS is immutable. It
is more reliable to model a computation by representing the inputs and outputs
of the computation using objects stored in CAS.
@@ -24,7 +24,7 @@ struct CASObject {
}
```
-With this abstraction, it is possible to compose CASObjects into a DAG that is
+With this abstraction, it is possible to compose `CASObject`s into a DAG that is
capable of representing complicated data structures, while still allowing data
deduplication. Note you can compare two DAGs by just comparing the CASObject
hash of two root nodes.
diff --git a/llvm/include/llvm/CAS/BuiltinCASContext.h b/llvm/include/llvm/CAS/BuiltinCASContext.h
index ebc4ca8bd1f2e..e9a226a423e5a 100644
--- a/llvm/include/llvm/CAS/BuiltinCASContext.h
+++ b/llvm/include/llvm/CAS/BuiltinCASContext.h
@@ -54,6 +54,7 @@ namespace llvm::cas::builtin {
using HasherT = BLAKE3;
using HashType = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>()));
+/// CASContext for LLVM builtin CAS using BLAKE3 hash type.
class BuiltinCASContext : public CASContext {
void printIDImpl(raw_ostream &OS, const CASID &ID) const final;
void anchor() override;
diff --git a/llvm/include/llvm/CAS/BuiltinObjectHasher.h b/llvm/include/llvm/CAS/BuiltinObjectHasher.h
index 22e556c5669b5..c9b004216f796 100644
--- a/llvm/include/llvm/CAS/BuiltinObjectHasher.h
+++ b/llvm/include/llvm/CAS/BuiltinObjectHasher.h
@@ -14,6 +14,7 @@
namespace llvm::cas {
+/// Hasher for stored objects in builtin CAS.
template <class HasherT> class BuiltinObjectHasher {
public:
using HashT = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>()));
diff --git a/llvm/lib/CAS/BuiltinCAS.h b/llvm/lib/CAS/BuiltinCAS.h
index 1a4f640e4e2da..19e45c86c5ff9 100644
--- a/llvm/lib/CAS/BuiltinCAS.h
+++ b/llvm/lib/CAS/BuiltinCAS.h
@@ -17,6 +17,7 @@ namespace llvm::cas {
class ActionCache;
namespace builtin {
+/// Common base class for builtin CAS implementations using the same CASContext.
class BuiltinCAS : public ObjectStore {
public:
BuiltinCAS() : ObjectStore(BuiltinCASContext::getDefaultContext()) {}
diff --git a/llvm/lib/CAS/InMemoryCAS.cpp b/llvm/lib/CAS/InMemoryCAS.cpp
index f0305e0d4eafa..6a586a1bd2a49 100644
--- a/llvm/lib/CAS/InMemoryCAS.cpp
+++ b/llvm/lib/CAS/InMemoryCAS.cpp
@@ -33,6 +33,7 @@ using InMemoryIndexT =
/// their hash.
using InMemoryIndexValueT = InMemoryIndexT::value_type;
+/// Builtin InMemory CAS that stores CAS object in the memory.
class InMemoryObject {
public:
enum class Kind {
More information about the llvm-commits
mailing list