[PATCH] D133716: [CAS] Add LLVMCAS library with InMemoryCAS implementation

Thu Dec 8 16:08:04 PST 2022

akyrtzi added inline comments.

================
Comment at: llvm/include/llvm/CAS/CASReference.h:162
+/// data and references.
+class ObjectHandle : public ReferenceBase {
+public:
----------------
dexonsmith wrote:
> akyrtzi wrote:
> > dexonsmith wrote:
> > > akyrtzi wrote:
> > > > dexonsmith wrote:
> > > > > steven_wu wrote:
> > > > > > akyrtzi wrote:
> > > > > > > benlangmuir wrote:
> > > > > > > > dexonsmith wrote:
> > > > > > > > > dexonsmith wrote:
> > > > > > > > > > akyrtzi wrote:
> > > > > > > > > > > steven_wu wrote:
> > > > > > > > > > > > dexonsmith wrote:
> > > > > > > > > > > > > dexonsmith wrote:
> > > > > > > > > > > > > > benlangmuir wrote:
> > > > > > > > > > > > > > > steven_wu wrote:
> > > > > > > > > > > > > > > > dblaikie wrote:
> > > > > > > > > > > > > > > > > steven_wu wrote:
> > > > > > > > > > > > > > > > > > dblaikie wrote:
> > > > > > > > > > > > > > > > > > > Since `Handle` and `Ref` aren't especially self documenting (while reading the docs for `Ref` I wondered if it should be called `Handle` - until I got to the bit where it explained that that exists too/separately) - is there something that'd be more explicit about the difference between these? (like, I wonder if `ObjectHandle` could be `Object`? - not sure that'd make `ObjectRef` more obvious, though... )
> > > > > > > > > > > > > > > > > > `ObjectRef` is just a reference, and you can't access anything in the object it points to, until you load the object using ObjectRef, which will turn into an `ObjectHandle`. The load can have latency or can fail.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Maybe the comments in ObjectStore.h is a better explanation with a bigger picture? Or should we put all the documentation into one place so it is easier to read?
> > > > > > > > > > > > > > > > > Still feels like the names could be more self-descriptive, even with/independent of documentation improvements. `Ref` and `Handle` both tend to connote a thing you can use to inspect the thing being referred to or handled. (if anything, I guess I'd actually expect a Handle to be the inaccessible version and the Ref to be the accessible version (thinking StringRef, ArrayRef, etc - where the data is directly accessible through that abstraction, whereas handles sometimes you have to pass back into some other object/API to then get the data (like a file handle/file descriptor)))
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Seems like ObjectRef is maybe more suitably called ObjectID?
> > > > > > > > > > > > > > > > I don't have strong opinion on names. Maybe @dexonsmith @benlangmuir @akyrtzi can also provide some feedbacks for names?
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > In my opinion, `ObjectRef` is named that way because it is like a reference type (just an opaque pointer) and you need to "dereference" it to access underlying data, just that the `dereference` might fail in CAS. The difference between `ObjectRef` and `ObjectHandle` is mostly providing the flexibility for a remote CAS so there is no need to traffic all the data if you only dealing with refs.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Also `ObjectRef` is the public type we encourage to use, just like other `Ref` types like `StringRef` and `ArrayRef`. It is cheap and allows quick fetch of the underlying data (at least for the builtin CAS we provided). It doesn't really contains ID, and you can't compare `Ref` from different ObjectStore.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > I thought about rename `CASID` to `ObjectID` to keep the name consistent, but there is no other more benefit. I am up for renaming types to make it more self-descriptive but I would like to reach an agreement before doing so because we have lots of downstream code needs to upstream and I would like to avoid repeated renames. 
> > > > > > > > > > > > > > > I agree with Steven; I think `ObjectRef` deserves the good name here, since that's the one that clients of an `ObjectStore` will work with.  `ObjectHandle` is only used by //implementors// of an `ObjectStore`.  I agree that having both `Ref` and `Handle` is confusing - maybe we should rename Handle to something more verbose and descriptive like `LoadedObjectRef`.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Seems like ObjectRef is maybe more suitably called ObjectID?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I think this is confusing with `CASID`, which really is an ID.  I don't have a strong opinion on whether `CASID` should be renamed to `ObjectID` or not for consistency.
> > > > > > > > > > > > > > In my mind, the name `ObjectID` was "taken", since I originally thought we'd do a `s/CASID/ObjectID/g` at some point (maybe this patch / before upstreaming is the right time?). Maybe it would be even better to do `s/CASID/std::string/`.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > (Historical note: an earlier design had neither ObjectRef nor ObjectHandle, just CASID. CASIDs were tied to a specific CAS instance, had odd lifetimes, and tried to serve all three purposes. We figured out that this was awkward, especially for distributed/remote CAS implementations.)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > `ObjectRef` is a reference to an object (possibly an ID, possible some internal CAS pointer). As @steven_wu points out, having an `ObjectRef` doesn't necessarily mean you know anything about the content of an object.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > `ObjectHandle` promises that you can access the object.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I think to explain the difference, it helps to have a specific example. If you have a few objects:
> > > > > > > > > > > > > > ```
> > > > > > > > > > > > > > object-id: 0
> > > > > > > > > > > > > > data: "some small data"
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > object-id: 1
> > > > > > > > > > > > > > data: [4GB of data]
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > object-id: 2
> > > > > > > > > > > > > > data: "a header"
> > > > > > > > > > > > > > refs: 0, 1
> > > > > > > > > > > > > > ```
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > If you have an `ObjectHandle` for object 2, then you can see that the data is "a header" and you can get the `ObjectRef`s for the objects it references: 0 and 1.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Those `ObjectRef`s allow you to talk about objects 0 and 1 without necessarily loading / downloading them, which is important because object 1 is 4GB big. You can create a new object that references them, and you can get a serialized reference (CASID/ObjectID/std::string) for referring to it in other contexts.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Or, you can "load" an `ObjectRef`, get an `ObjectHandle` for that object, and look at its content/refs. If you have a distributed CAS, or the object is big, or [etc.], this action might have significant latency. I imagine we'd eventually want to add an API to allow asynchronous loading.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > It's a good point that "ref" here bears no relation to the "ref" in `ArrayRef`. That makes it confusing. It is identifier-like, even if it doesn't know how to serialize itself into a string (you need the CAS instance to string-ify it for you).
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > At a high level, we have three concepts:
> > > > > > > > > > > > > > - a serialized, context-free identifier (currently CASID)
> > > > > > > > > > > > > > - an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef)
> > > > > > > > > > > > > > - a context-sensitive reference to an object known to "be local/loaded" (currently ObjectHandle)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Maybe the following rename could work:
> > > > > > > > > > > > > > 1. Delete CASID and replace its usage with `std::string`
> > > > > > > > > > > > > > 2. Rename ObjectRef to ObjectID
> > > > > > > > > > > > > > 3. Maybe: Rename ObjectHandle to ObjectRef
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I don't have a strong opinion about whether that results in a better world. The "current" names are kind of etched into my brain so it's hard to evaluate.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > @steven_wu and @benlangmuir, WDYT? Is there something I'm missing/forgetting that makes this a bad idea?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > @dblaikie, WDYT? Maybe you have other name ideas after the above explanation?
> > > > > > > > > > > > > @benlangmuir, just seeing your reply. I'd have expected clients to work with both `ObjectRef` and `ObjectHandle`; but you have much more experience than I do. (I guess the `ObjectHandle` is only indirectly useful to clients. For any serious work, you want to create a proxy that wraps an `ObjectHandle` and navigates the content in a structured way.)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Agreed that `ObjectRef` ought to have a good / easy-to-type name.
> > > > > > > > > > > > @dexonsmith Correct, we clean up the interface that no public methods from `ObjectStore` return `ObjectHandle` anymore so users access data in object through ObjectProxy. One less data type to deal with for CAS users and ObjectProxy has slightly better interfaces to type as well.
> > > > > > > > > > > > 
> > > > > > > > > > > > `ObjectHandle` is type for CAS implementor only now to represent loaded object.
> > > > > > > > > > > > I'd have expected clients to work with both `ObjectRef` and `ObjectHandle`
> > > > > > > > > > > 
> > > > > > > > > > > I don't think `ObjectHandle` has any value for clients, clients will be using `ObjectProxy`.
> > > > > > > > > > > 
> > > > > > > > > > > > an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef)
> > > > > > > > > > > 
> > > > > > > > > > > Something to highlight about this: For a "remote CAS" implementation you practically don't know that "an object is known to exist" until you try to load it. From that respect there is essentially no functional difference between `CASID` and `ObjectRef` in such a context.
> > > > > > > > > > > 
> > > > > > > > > > Ah, that sounds great (I apologize for not reading the patch itself). Agreed with @benlangmuir that LoadedObjectHandle seems pretty reasonable if clients never have to use it.
> > > > > > > > > > 
> > > > > > > > > > Does my understanding the roles still stand? (Maybe that example could/should be used somewhere in the docs to clarify.)
> > > > > > > > > > 
> > > > > > > > > > Also:
> > > > > > > > > > - Do we still need CASID? Why not just use `std::string`?
> > > > > > > > > > - If we don't need CASID and can delete it, WDYT of renaming ObjectRef to ObjectID? (It could be thought of as an "opaque" identifier.)
> > > > > > > > > > - Another idea: rename `ObjectRef` to `ObjectPtr`?
> > > > > > > > > > Something to highlight about this: For a "remote CAS" implementation you practically don't know that "an object is known to exist" until you try to load it. From that respect there is essentially no functional difference between CASID and ObjectRef in such a context.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Ah, is that because a remote CAS may have garbage-collected the object? So, then, `ObjectRef` is just an object we know how to point at?
> > > > > > > > > 
> > > > > > > > > (Or do you just mean that loading could fail because the remote CAS goes down?)
> > > > > > > > > Ah, is that because a remote CAS may have garbage-collected the object? So, then, ObjectRef is just an object we know how to point at?
> > > > > > > > > (Or do you just mean that loading could fail because the remote CAS goes down?)
> > > > > > > > 
> > > > > > > > We do not check whether an object exists at all when creating the ObjectRef for a remote CAS: it's a wasted round trip over the network since you can't guarantee the object still exists later, and don't get a performance win for knowing it existed in the past.  This is different with the local in-memory and on-disk CASes, because they can check for existence cheaply and provide a faster ref implementation once we know the object exists.
> > > > > > > > 
> > > > > > > > > - a serialized, context-free identifier (currently CASID)
> > > > > > > > 
> > > > > > > > There are two levels here:
> > > > > > > > - a serialized, context-free identifier (std::string, e.g. "llvmcas://<serialized hash>")
> > > > > > > > - a context-sensitive identifier hash containing raw hash bytes (CASID) -- you need a CASContext to use it
> > > > > > > > 
> > > > > > > > > an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef)
> > > > > > > > 
> > > > > > > > Not known to exist anywhere, but otherwise yes.
> > > > > > > > 
> > > > > > > > > a context-sensitive reference to an object known to "be local/loaded" (currently ObjectHandle)
> > > > > > > > 
> > > > > > > > Correct.
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Also:
> > > > > > > > > - Do we still need CASID? Why not just use `std::string`?
> > > > > > > > 
> > > > > > > > So the question here is whether CASID is pulling its weight as a client-visible type.  It should be more efficient than the serialized string, since you have the hash bytes immediately available rather than parsing them out; it can be smaller for the same reason.  I don't know if have measured this -- @steven_wu, @akyrtzi any thoughts?
> > > > > > > > 
> > > > > > > > > - If we don't need CASID and can delete it, WDYT of renaming ObjectRef to ObjectID? (It could be thought of as an "opaque" identifier.)
> > > > > > > > 
> > > > > > > > If we can drop CASID, I'm fine with this.
> > > > > > > > 
> > > > > > > > > - Another idea: rename `ObjectRef` to `ObjectPtr`?
> > > > > > > > 
> > > > > > > > I find Ref clearer than Ptr for what it does.
> > > > > > > I like the idea of dropping `CASID` entirely. I'm also fine with renaming `ObjectRef` to `ObjectID`.
> > > > > > > I would also suggest that we change
> > > > > > > ```
> > > > > > > virtual Expected<ObjectHandle> load(ObjectRef Ref)
> > > > > > > ```
> > > > > > > to
> > > > > > > ```
> > > > > > > virtual Expected<Optional<ObjectHandle>> load(ObjectRef Ref)
> > > > > > > ```
> > > > > > > 
> > > > > > > The optional indicates whether the object existed or not, and we reserve errors for catastrophic failures (e.g. "network went down"). The client could choose to turn the "object doesn't exist" `None` into an error (or not, it depends on the context) but this case should be distinguishable from `ObjectStore`'s API.
> > > > > > I tried to get rid of `CASID` but there are some desirable features that decide to keep it for now. For example, it distinguishes itself from a printable string.
> > > > > > 
> > > > > > For builtinCAS, you have a printable string that is something like: `llvmcas://abcdefg...`, while CASID is the raw hash value (0xabcdefg...) + the context ("llvm.builtin.cas.v1[BLAKE3]"). If we get rid of CASID, you will either have a std::string representation that is `llvmcas://` then you need to parse it every time you want access to raw hash, or you decide to have `std::string` to store the raw hash value, then: 1. you can't store a context anywhere. 2. It is confusing if a function is taking a printable string or a raw hash value, since both of them are std::string.
> > > > > What is "every time"? I.e., why do you need to get the hash from the CASID? Could/should some/many places that currently use CASID use an ObjectRef instead?
> > > > > 
> > > > > IOW, maybe some current uses of CASID should switch to ObjectRef or ObjectProxy, and the rest can then be converted to std::string. But I'm not sure. Only asking the question to ensure that you've considered it (vs. momentum of code written with CASID before ObjectRef existed).
> > > > > then you need to parse it every time you want access to raw hash
> > > > 
> > > > Why do I need to parse the string to get the raw hash, I should be able to get the raw hash from the `ObjectRef`/`ObjectID`, right?
> > > > Once I parse a "llvmcas://" string into an `ObjectRef`/`ObjectID` there should be no need to keep the string around anymore.
> > > Paging in old memories, I think where CASID remains useful is if you're talking about the same object in two different instances of a CAS (maybe one instance is in-memory, and the other is on-disk or remote), and both have the same CASContext. IIRC, you share a CASContext iff the std::strings mean the same thing. E.g., should be true for any CAS that has a std::string ID starting with `llvmcas://`, as long as both instances use the same hash function (currently always BLAKE3, right?). (Unless this changed...)
> > > 
> > > If a client is doing a lot of this, having an optimized representation of the hash/identifier could be useful; it avoids repeatedly serializing/parsing/validating std::strings just to communicate a hash from one CAS instance to another. (In this scenario, it's also maybe nice to avoid a malloc/dealloc, but the hash will always overflow std::string's small storage... that was the motivation to avoid std::string originally.)
> > > 
> > > I imagine we'll want to do that sort of thing at some point, but not sure how urgent it is (maybe it's already needed?), or whether the optimization is really necessary (maybe the overhead is hidden by other things that are orders of magnitude more expensive?)...
> > > 
> > > Outside of that multi-CAS-instance-same-CAScontext scenario, I'm not sure I see a reason to avoid using ObjectRef (or std::string).
> > > 
> > > Another thought: if we don't drop CASID, it could be renamed to ObjectHash.
> > > - ObjectHash: the raw hash of the object, shareable within a CASContext (né CASID)
> > > - ObjectID: an opaque identifier for a specific CAS instance (né ObjectRef)
> > > - LoadedObjectHandle: implementation detail, points at "loaded" object in a CAS instance (né ObjectHandle)
> > > - ObjectProxy: LoadedObjectHandle+ObjectStore+nice APIs.
> > > 
> > > it avoids repeatedly serializing/parsing/validating std::strings just to communicate a hash from one CAS instance to another.
> > 
> > Hmm, it's still unclear to me why you need to traffic "llvmcas://" strings for that; the `ObjectRef/ID` that you got from one instance should be able to provide the hash bytes for the other. You don't need to re-validate the hash because the instances are context-compatible.
> IIRC, CASID is the only container for the raw hash recognized by ObjectStore. If you take it away, and you want to traffic in raw hashes you need to invent a new container for them (or just let people use ArrayRef). 
> 
> If you think it’s useful to traffic in raw hashes, then I suggest keeping CASID/ObjectHash, since it has some error checking and dumping built-in. 
> or just let people use ArrayRef

I'm personally fine with making it that the way you get an `ObjectID` is either via parsing a string identifier or via passing the `ArrayRef` hash bytes (for example, say I stored the hash as raw bytes in a file and now I want to get an `ObjectID` out of these bytes).

Like a string identifier from a command-line option, the raw hash bytes will be a transient input that I will use to get an `ObjectID` in order to proceed with the rest of the work, I shouldn't need to be "trafficking" both hash bytes and ObjectIDs, at the same time.

In general, as a client I'd like to have only 3 concepts that I need to be concerned about:

1. A string identifier for a CAS object that can be parsed and printed (e.g. the "llvmcas://" strings)
2. A type I can get as a valid ID after parsing a string identifier or after passing the hash bytes for the object (e.g. `ObjectID`)
3. A type that contains the loaded data for the object. (e.g. `ObjectProxy`)

An `ObjectHash` type could be useful for the implementations but I find it an unnecessary concept for the users of `ObjectStore` API.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133716/new/

https://reviews.llvm.org/D133716