[PATCH] D133716: [CAS] Add LLVMCAS library with InMemoryCAS implementation

Thu Dec 8 11:19:35 PST 2022

steven_wu added inline comments.

================
Comment at: llvm/include/llvm/CAS/CASReference.h:162
+/// data and references.
+class ObjectHandle : public ReferenceBase {
+public:
----------------
dexonsmith wrote:
> dexonsmith wrote:
> > benlangmuir wrote:
> > > steven_wu wrote:
> > > > dblaikie wrote:
> > > > > steven_wu wrote:
> > > > > > dblaikie wrote:
> > > > > > > Since `Handle` and `Ref` aren't especially self documenting (while reading the docs for `Ref` I wondered if it should be called `Handle` - until I got to the bit where it explained that that exists too/separately) - is there something that'd be more explicit about the difference between these? (like, I wonder if `ObjectHandle` could be `Object`? - not sure that'd make `ObjectRef` more obvious, though... )
> > > > > > `ObjectRef` is just a reference, and you can't access anything in the object it points to, until you load the object using ObjectRef, which will turn into an `ObjectHandle`. The load can have latency or can fail.
> > > > > > 
> > > > > > Maybe the comments in ObjectStore.h is a better explanation with a bigger picture? Or should we put all the documentation into one place so it is easier to read?
> > > > > Still feels like the names could be more self-descriptive, even with/independent of documentation improvements. `Ref` and `Handle` both tend to connote a thing you can use to inspect the thing being referred to or handled. (if anything, I guess I'd actually expect a Handle to be the inaccessible version and the Ref to be the accessible version (thinking StringRef, ArrayRef, etc - where the data is directly accessible through that abstraction, whereas handles sometimes you have to pass back into some other object/API to then get the data (like a file handle/file descriptor)))
> > > > > 
> > > > > Seems like ObjectRef is maybe more suitably called ObjectID?
> > > > I don't have strong opinion on names. Maybe @dexonsmith @benlangmuir @akyrtzi can also provide some feedbacks for names?
> > > > 
> > > > In my opinion, `ObjectRef` is named that way because it is like a reference type (just an opaque pointer) and you need to "dereference" it to access underlying data, just that the `dereference` might fail in CAS. The difference between `ObjectRef` and `ObjectHandle` is mostly providing the flexibility for a remote CAS so there is no need to traffic all the data if you only dealing with refs.
> > > > 
> > > > Also `ObjectRef` is the public type we encourage to use, just like other `Ref` types like `StringRef` and `ArrayRef`. It is cheap and allows quick fetch of the underlying data (at least for the builtin CAS we provided). It doesn't really contains ID, and you can't compare `Ref` from different ObjectStore.
> > > > 
> > > > I thought about rename `CASID` to `ObjectID` to keep the name consistent, but there is no other more benefit. I am up for renaming types to make it more self-descriptive but I would like to reach an agreement before doing so because we have lots of downstream code needs to upstream and I would like to avoid repeated renames. 
> > > I agree with Steven; I think `ObjectRef` deserves the good name here, since that's the one that clients of an `ObjectStore` will work with.  `ObjectHandle` is only used by //implementors// of an `ObjectStore`.  I agree that having both `Ref` and `Handle` is confusing - maybe we should rename Handle to something more verbose and descriptive like `LoadedObjectRef`.
> > > 
> > > > Seems like ObjectRef is maybe more suitably called ObjectID?
> > > 
> > > I think this is confusing with `CASID`, which really is an ID.  I don't have a strong opinion on whether `CASID` should be renamed to `ObjectID` or not for consistency.
> > In my mind, the name `ObjectID` was "taken", since I originally thought we'd do a `s/CASID/ObjectID/g` at some point (maybe this patch / before upstreaming is the right time?). Maybe it would be even better to do `s/CASID/std::string/`.
> > 
> > (Historical note: an earlier design had neither ObjectRef nor ObjectHandle, just CASID. CASIDs were tied to a specific CAS instance, had odd lifetimes, and tried to serve all three purposes. We figured out that this was awkward, especially for distributed/remote CAS implementations.)
> > 
> > `ObjectRef` is a reference to an object (possibly an ID, possible some internal CAS pointer). As @steven_wu points out, having an `ObjectRef` doesn't necessarily mean you know anything about the content of an object.
> > 
> > `ObjectHandle` promises that you can access the object.
> > 
> > I think to explain the difference, it helps to have a specific example. If you have a few objects:
> > ```
> > object-id: 0
> > data: "some small data"
> > 
> > object-id: 1
> > data: [4GB of data]
> > 
> > object-id: 2
> > data: "a header"
> > refs: 0, 1
> > ```
> > 
> > If you have an `ObjectHandle` for object 2, then you can see that the data is "a header" and you can get the `ObjectRef`s for the objects it references: 0 and 1.
> > 
> > Those `ObjectRef`s allow you to talk about objects 0 and 1 without necessarily loading / downloading them, which is important because object 1 is 4GB big. You can create a new object that references them, and you can get a serialized reference (CASID/ObjectID/std::string) for referring to it in other contexts.
> > 
> > Or, you can "load" an `ObjectRef`, get an `ObjectHandle` for that object, and look at its content/refs. If you have a distributed CAS, or the object is big, or [etc.], this action might have significant latency. I imagine we'd eventually want to add an API to allow asynchronous loading.
> > 
> > It's a good point that "ref" here bears no relation to the "ref" in `ArrayRef`. That makes it confusing. It is identifier-like, even if it doesn't know how to serialize itself into a string (you need the CAS instance to string-ify it for you).
> > 
> > At a high level, we have three concepts:
> > - a serialized, context-free identifier (currently CASID)
> > - an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef)
> > - a context-sensitive reference to an object known to "be local/loaded" (currently ObjectHandle)
> > 
> > Maybe the following rename could work:
> > 1. Delete CASID and replace its usage with `std::string`
> > 2. Rename ObjectRef to ObjectID
> > 3. Maybe: Rename ObjectHandle to ObjectRef
> > 
> > I don't have a strong opinion about whether that results in a better world. The "current" names are kind of etched into my brain so it's hard to evaluate.
> > 
> > @steven_wu and @benlangmuir, WDYT? Is there something I'm missing/forgetting that makes this a bad idea?
> > 
> > @dblaikie, WDYT? Maybe you have other name ideas after the above explanation?
> @benlangmuir, just seeing your reply. I'd have expected clients to work with both `ObjectRef` and `ObjectHandle`; but you have much more experience than I do. (I guess the `ObjectHandle` is only indirectly useful to clients. For any serious work, you want to create a proxy that wraps an `ObjectHandle` and navigates the content in a structured way.)
> 
> Agreed that `ObjectRef` ought to have a good / easy-to-type name.
@dexonsmith Correct, we clean up the interface that no public methods from `ObjectStore` return `ObjectHandle` anymore so users access data in object through ObjectProxy. One less data type to deal with for CAS users and ObjectProxy has slightly better interfaces to type as well.

`ObjectHandle` is type for CAS implementor only now to represent loaded object.

================
Comment at: llvm/include/llvm/CAS/ObjectStore.h:59
+///   "Loading" the object is a separate step that may not have happened
+///   yet, and which can fail (e.g. due to filesystem corruption) or introduce
+///   latency (if downloading from a remote store).
----------------
benlangmuir wrote:
> steven_wu wrote:
> > dblaikie wrote:
> > > Presumably it can also fail if the object isn't in the given CAS? (maybe that's the more obvious/simpler to document example?)
> > This question is complicated. Maybe we should write down a spec for the CAS APIs? To this question, the current spec for when to lookup for a CAS object to make sure it exists is totally implementation defined.
> > 
> > For builtinCAS here, ObjectRef always points to existing object, unless the integrity of the builtinCAS is broken. For a remote CAS, you might actually want to have ObjectRef to be unverified to avoid a roundtrip to remote to validate a ObjectRef. 
> > 
> > Also to you previous point of split up public/private interface document, is it better if I create a separate docs in `llvm/docs` to explain different concepts from the views of: 1. CAS users 2. CAS implementors?
> > This question is complicated. Maybe we should write down a spec for the CAS APIs? To this question, the current spec for when to lookup for a CAS object to make sure it exists is totally implementation defined
> 
> This complication is only for the implementation though, right? Clients of the object store shouldn't assume the object exists in the CAS, and like @dblaikie said, loading can fail if it is not.  I don't think we want to promise that builtin CAS will always succeed when loading, since we might want to change that kind of detail later.
True. I will update the wording.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133716/new/

https://reviews.llvm.org/D133716