[llvm] [CAS] Add LLVMCAS library with InMemoryCAS implementation (PR #114096)

Steven Wu via llvm-commits llvm-commits at lists.llvm.org
Fri Aug 8 16:27:57 PDT 2025


================
@@ -0,0 +1,298 @@
+//===- llvm/CAS/ObjectStore.h -----------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CAS_OBJECTSTORE_H
+#define LLVM_CAS_OBJECTSTORE_H
+
+#include "llvm/ADT/StringRef.h"
+#include "llvm/CAS/CASID.h"
+#include "llvm/CAS/CASReference.h"
+#include "llvm/Support/Error.h"
+#include "llvm/Support/FileSystem.h"
+#include <cstddef>
+
+namespace llvm {
+
+class MemoryBuffer;
+template <typename T> class unique_function;
+
+namespace cas {
+
+class ObjectStore;
+class ObjectProxy;
+
+/// Content-addressable storage for objects.
+///
+/// Conceptually, objects are stored in a "unique set".
+///
+/// - Objects are immutable ("value objects") that are defined by their
+///   content. They are implicitly deduplicated by content.
+/// - Each object has a unique identifier (UID) that's derived from its content,
+///   called a \a CASID.
+///     - This UID is a fixed-size (strong) hash of the transitive content of a
+///       CAS object.
+///     - It's comparable between any two CAS instances that have the same \a
+///       CASIDContext::getHashSchemaIdentifier().
+///     - The UID can be printed (e.g., \a CASID::toString()) and it can parsed
+///       by the same or a different CAS instance with \a
+///       ObjectStore::parseID().
+/// - An object can be looked up by content or by UID.
+///     - \a store() is "get-or-create"  methods, writing an object if it
+///       doesn't exist yet, and return a ref to it in any case.
+///     - \a loadObject(const CASID&) looks up an object by its UID.
+/// - Objects can reference other objects, forming an arbitrary DAG.
+///
+/// The \a ObjectStore interface has a few ways of referencing objects:
+///
+/// - \a ObjectRef encapsulates a reference to something in the CAS. It is an
+///   opaque type that references an object inside a specific CAS. It is
+///   implementation defined if the underlying object exists or not for an
+///   ObjectRef, and it can used to speed up CAS lookup as an implementation
+///   detail. However, you don't know anything about the underlying objects.
+///   "Loading" the object is a separate step that may not have happened
+///   yet, and which can fail (e.g. due to filesystem corruption) or introduce
+///   latency (if downloading from a remote store).
+/// - \a ObjectHandle encapulates a *loaded* object in the CAS. You need one of
+///   these to inspect the content of an object: to look at its stored
+///   data and references. This is internal to CAS implementation and not
+///   availble from CAS public APIs.
+/// - \a CASID: the UID for an object in the CAS, obtained through \a
+///   ObjectStore::getID() or \a ObjectStore::parseID(). This is a valid CAS
+///   identifier, but may reference an object that is unknown to this CAS
+///   instance.
+/// - \a ObjectProxy pairs an ObjectHandle (subclass) with a ObjectStore, and
+///   wraps access APIs to avoid having to pass extra parameters. It is the
+///   object used for accessing underlying data and refs by CAS users.
+///
+/// Both ObjectRef and ObjectHandle are lightweight, wrapping a `uint64_t` and
+/// are only valid with the associated ObjectStore instance.
+///
+/// There are a few options for accessing content of objects, with different
+/// lifetime tradeoffs:
+///
+/// - \a getData() accesses data without exposing lifetime at all.
+/// - \a getMemoryBuffer() returns a \a MemoryBuffer whose lifetime
+///   is independent of the CAS (it can live longer).
+/// - \a getDataString() return StringRef with lifetime is guaranteed to last as
+///   long as \a ObjectStore.
+/// - \a readRef() and \a forEachRef() iterate through the references in an
+///   object. There is no lifetime assumption.
+class ObjectStore {
+  friend class ObjectProxy;
+  void anchor();
+
+public:
+  /// Get a \p CASID from a \p ID, which should have been generated by \a
+  /// CASID::print(). This succeeds as long as \a validateID() would pass. The
+  /// object may be unknown to this CAS instance.
+  ///
+  /// TODO: Remove, and update callers to use \a validateID() or \a
+  /// extractHashFromID().
+  virtual Expected<CASID> parseID(StringRef ID) = 0;
+
+  /// Store object into ObjectStore.
+  virtual Expected<ObjectRef> store(ArrayRef<ObjectRef> Refs,
+                                    ArrayRef<char> Data) = 0;
+  /// Get an ID for \p Ref.
+  virtual CASID getID(ObjectRef Ref) const = 0;
+
+  /// Get an existing reference to the object called \p ID.
+  ///
+  /// Returns \c None if the object is not stored in this CAS.
+  virtual std::optional<ObjectRef> getReference(const CASID &ID) const = 0;
+
+  /// \returns true if the object is directly available from the local CAS, for
+  /// implementations that have this kind of distinction.
+  virtual Expected<bool> isMaterialized(ObjectRef Ref) const = 0;
+
+  /// Validate the underlying object referred by CASID.
+  virtual Error validate(const CASID &ID) = 0;
+
+protected:
+  /// Load the object referenced by \p Ref.
+  ///
+  /// Errors if the object cannot be loaded.
+  /// \returns \c std::nullopt if the object is missing from the CAS.
+  virtual Expected<std::optional<ObjectHandle>> loadIfExists(ObjectRef Ref) = 0;
+
+  /// Like \c loadIfExists but returns an error if the object is missing.
+  Expected<ObjectHandle> load(ObjectRef Ref);
+
+  /// Get the size of some data.
+  virtual uint64_t getDataSize(ObjectHandle Node) const = 0;
+
+  /// Methods for handling objects.
+  virtual Error forEachRef(ObjectHandle Node,
+                           function_ref<Error(ObjectRef)> Callback) const = 0;
+  virtual ObjectRef readRef(ObjectHandle Node, size_t I) const = 0;
+  virtual size_t getNumRefs(ObjectHandle Node) const = 0;
+  virtual ArrayRef<char> getData(ObjectHandle Node,
+                                 bool RequiresNullTerminator = false) const = 0;
+
+  /// Get ObjectRef from open file.
+  virtual Expected<ObjectRef>
+  storeFromOpenFileImpl(sys::fs::file_t FD,
+                        std::optional<sys::fs::file_status> Status);
+
+  /// Get a lifetime-extended StringRef pointing at \p Data.
+  ///
+  /// Depending on the CAS implementation, this may involve in-memory storage
+  /// overhead.
+  StringRef getDataString(ObjectHandle Node) {
+    return toStringRef(getData(Node));
+  }
+
+  /// Get a lifetime-extended MemoryBuffer pointing at \p Data.
+  ///
+  /// Depending on the CAS implementation, this may involve in-memory storage
+  /// overhead.
+  std::unique_ptr<MemoryBuffer>
+  getMemoryBuffer(ObjectHandle Node, StringRef Name = "",
+                  bool RequiresNullTerminator = true);
+
+  /// Read all the refs from object in a SmallVector.
+  virtual void readRefs(ObjectHandle Node,
+                        SmallVectorImpl<ObjectRef> &Refs) const;
+
+  /// Allow ObjectStore implementations to create internal handles.
+#define MAKE_CAS_HANDLE_CONSTRUCTOR(HandleKind)                                \
+  HandleKind make##HandleKind(uint64_t InternalRef) const {                    \
+    return HandleKind(*this, InternalRef);                                     \
+  }
+  MAKE_CAS_HANDLE_CONSTRUCTOR(ObjectHandle)
+  MAKE_CAS_HANDLE_CONSTRUCTOR(ObjectRef)
+#undef MAKE_CAS_HANDLE_CONSTRUCTOR
+
+public:
+  /// Helper functions to store object and returns a ObjectProxy.
+  Expected<ObjectProxy> createProxy(ArrayRef<ObjectRef> Refs, StringRef Data);
+
+  /// Store object from StringRef.
+  Expected<ObjectRef> storeFromString(ArrayRef<ObjectRef> Refs,
+                                      StringRef String) {
+    return store(Refs, arrayRefFromStringRef<char>(String));
+  }
+
+  /// Default implementation reads \p FD and calls \a storeNode(). Does not
+  /// take ownership of \p FD; the caller is responsible for closing it.
+  ///
+  /// If \p Status is sent in it is to be treated as a hint. Implementations
+  /// must protect against the file size potentially growing after the status
+  /// was taken (i.e., they cannot assume that an mmap will be null-terminated
+  /// where \p Status implies).
+  ///
+  /// Returns the \a CASID and the size of the file.
+  Expected<ObjectRef>
+  storeFromOpenFile(sys::fs::file_t FD,
+                    std::optional<sys::fs::file_status> Status = std::nullopt) {
+    return storeFromOpenFileImpl(FD, Status);
+  }
+
+  static Error createUnknownObjectError(const CASID &ID);
+
+  /// Create ObjectProxy from CASID. If the object doesn't exist, get an error.
+  Expected<ObjectProxy> getProxy(const CASID &ID);
+  /// Create ObjectProxy from ObjectRef. If the object can't be loaded, get an
+  /// error.
+  Expected<ObjectProxy> getProxy(ObjectRef Ref);
+
+  /// \returns \c std::nullopt if the object is missing from the CAS.
+  Expected<std::optional<ObjectProxy>> getProxyIfExists(ObjectRef Ref);
+
+  /// Read the data from \p Data into \p OS.
+  uint64_t readData(ObjectHandle Node, raw_ostream &OS, uint64_t Offset = 0,
+                    uint64_t MaxBytes = -1ULL) const {
+    ArrayRef<char> Data = getData(Node);
+    assert(Offset < Data.size() && "Expected valid offset");
+    Data = Data.drop_front(Offset).take_front(MaxBytes);
+    OS << toStringRef(Data);
+    return Data.size();
+  }
+
+  /// Validate the whole node tree.
+  Error validateTree(ObjectRef Ref);
+
+  /// Get CASContext
+  const CASContext &getContext() const { return Context; }
+
+  virtual ~ObjectStore() = default;
+
+protected:
+  ObjectStore(const CASContext &Context) : Context(Context) {}
+
+private:
+  const CASContext &Context;
+};
+
+/// Reference to an abstract hierarchical node, with data and references.
+/// Reference is passed by value and is expected to be valid as long as the \a
+/// ObjectStore is.
+class ObjectProxy {
+public:
+  const ObjectStore &getCAS() const { return *CAS; }
+  ObjectStore &getCAS() { return *CAS; }
----------------
cachemeifyoucan wrote:

There is a overloading problem if I drop `const` but I guess returning a `const ObjectStore&` is not actually useful, so unify to one method `ObjectStore &getCAS() const;`

https://github.com/llvm/llvm-project/pull/114096


More information about the llvm-commits mailing list