[PATCH] D133716: [CAS] Add LLVMCAS library with InMemoryCAS implementation

Thu Dec 8 11:25:46 PST 2022

dexonsmith added inline comments.

================
Comment at: llvm/include/llvm/CAS/CASReference.h:162
+/// data and references.
+class ObjectHandle : public ReferenceBase {
+public:
----------------
akyrtzi wrote:
> steven_wu wrote:
> > dexonsmith wrote:
> > > dexonsmith wrote:
> > > > benlangmuir wrote:
> > > > > steven_wu wrote:
> > > > > > dblaikie wrote:
> > > > > > > steven_wu wrote:
> > > > > > > > dblaikie wrote:
> > > > > > > > > Since `Handle` and `Ref` aren't especially self documenting (while reading the docs for `Ref` I wondered if it should be called `Handle` - until I got to the bit where it explained that that exists too/separately) - is there something that'd be more explicit about the difference between these? (like, I wonder if `ObjectHandle` could be `Object`? - not sure that'd make `ObjectRef` more obvious, though... )
> > > > > > > > `ObjectRef` is just a reference, and you can't access anything in the object it points to, until you load the object using ObjectRef, which will turn into an `ObjectHandle`. The load can have latency or can fail.
> > > > > > > > 
> > > > > > > > Maybe the comments in ObjectStore.h is a better explanation with a bigger picture? Or should we put all the documentation into one place so it is easier to read?
> > > > > > > Still feels like the names could be more self-descriptive, even with/independent of documentation improvements. `Ref` and `Handle` both tend to connote a thing you can use to inspect the thing being referred to or handled. (if anything, I guess I'd actually expect a Handle to be the inaccessible version and the Ref to be the accessible version (thinking StringRef, ArrayRef, etc - where the data is directly accessible through that abstraction, whereas handles sometimes you have to pass back into some other object/API to then get the data (like a file handle/file descriptor)))
> > > > > > > 
> > > > > > > Seems like ObjectRef is maybe more suitably called ObjectID?
> > > > > > I don't have strong opinion on names. Maybe @dexonsmith @benlangmuir @akyrtzi can also provide some feedbacks for names?
> > > > > > 
> > > > > > In my opinion, `ObjectRef` is named that way because it is like a reference type (just an opaque pointer) and you need to "dereference" it to access underlying data, just that the `dereference` might fail in CAS. The difference between `ObjectRef` and `ObjectHandle` is mostly providing the flexibility for a remote CAS so there is no need to traffic all the data if you only dealing with refs.
> > > > > > 
> > > > > > Also `ObjectRef` is the public type we encourage to use, just like other `Ref` types like `StringRef` and `ArrayRef`. It is cheap and allows quick fetch of the underlying data (at least for the builtin CAS we provided). It doesn't really contains ID, and you can't compare `Ref` from different ObjectStore.
> > > > > > 
> > > > > > I thought about rename `CASID` to `ObjectID` to keep the name consistent, but there is no other more benefit. I am up for renaming types to make it more self-descriptive but I would like to reach an agreement before doing so because we have lots of downstream code needs to upstream and I would like to avoid repeated renames. 
> > > > > I agree with Steven; I think `ObjectRef` deserves the good name here, since that's the one that clients of an `ObjectStore` will work with.  `ObjectHandle` is only used by //implementors// of an `ObjectStore`.  I agree that having both `Ref` and `Handle` is confusing - maybe we should rename Handle to something more verbose and descriptive like `LoadedObjectRef`.
> > > > > 
> > > > > > Seems like ObjectRef is maybe more suitably called ObjectID?
> > > > > 
> > > > > I think this is confusing with `CASID`, which really is an ID.  I don't have a strong opinion on whether `CASID` should be renamed to `ObjectID` or not for consistency.
> > > > In my mind, the name `ObjectID` was "taken", since I originally thought we'd do a `s/CASID/ObjectID/g` at some point (maybe this patch / before upstreaming is the right time?). Maybe it would be even better to do `s/CASID/std::string/`.
> > > > 
> > > > (Historical note: an earlier design had neither ObjectRef nor ObjectHandle, just CASID. CASIDs were tied to a specific CAS instance, had odd lifetimes, and tried to serve all three purposes. We figured out that this was awkward, especially for distributed/remote CAS implementations.)
> > > > 
> > > > `ObjectRef` is a reference to an object (possibly an ID, possible some internal CAS pointer). As @steven_wu points out, having an `ObjectRef` doesn't necessarily mean you know anything about the content of an object.
> > > > 
> > > > `ObjectHandle` promises that you can access the object.
> > > > 
> > > > I think to explain the difference, it helps to have a specific example. If you have a few objects:
> > > > ```
> > > > object-id: 0
> > > > data: "some small data"
> > > > 
> > > > object-id: 1
> > > > data: [4GB of data]
> > > > 
> > > > object-id: 2
> > > > data: "a header"
> > > > refs: 0, 1
> > > > ```
> > > > 
> > > > If you have an `ObjectHandle` for object 2, then you can see that the data is "a header" and you can get the `ObjectRef`s for the objects it references: 0 and 1.
> > > > 
> > > > Those `ObjectRef`s allow you to talk about objects 0 and 1 without necessarily loading / downloading them, which is important because object 1 is 4GB big. You can create a new object that references them, and you can get a serialized reference (CASID/ObjectID/std::string) for referring to it in other contexts.
> > > > 
> > > > Or, you can "load" an `ObjectRef`, get an `ObjectHandle` for that object, and look at its content/refs. If you have a distributed CAS, or the object is big, or [etc.], this action might have significant latency. I imagine we'd eventually want to add an API to allow asynchronous loading.
> > > > 
> > > > It's a good point that "ref" here bears no relation to the "ref" in `ArrayRef`. That makes it confusing. It is identifier-like, even if it doesn't know how to serialize itself into a string (you need the CAS instance to string-ify it for you).
> > > > 
> > > > At a high level, we have three concepts:
> > > > - a serialized, context-free identifier (currently CASID)
> > > > - an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef)
> > > > - a context-sensitive reference to an object known to "be local/loaded" (currently ObjectHandle)
> > > > 
> > > > Maybe the following rename could work:
> > > > 1. Delete CASID and replace its usage with `std::string`
> > > > 2. Rename ObjectRef to ObjectID
> > > > 3. Maybe: Rename ObjectHandle to ObjectRef
> > > > 
> > > > I don't have a strong opinion about whether that results in a better world. The "current" names are kind of etched into my brain so it's hard to evaluate.
> > > > 
> > > > @steven_wu and @benlangmuir, WDYT? Is there something I'm missing/forgetting that makes this a bad idea?
> > > > 
> > > > @dblaikie, WDYT? Maybe you have other name ideas after the above explanation?
> > > @benlangmuir, just seeing your reply. I'd have expected clients to work with both `ObjectRef` and `ObjectHandle`; but you have much more experience than I do. (I guess the `ObjectHandle` is only indirectly useful to clients. For any serious work, you want to create a proxy that wraps an `ObjectHandle` and navigates the content in a structured way.)
> > > 
> > > Agreed that `ObjectRef` ought to have a good / easy-to-type name.
> > @dexonsmith Correct, we clean up the interface that no public methods from `ObjectStore` return `ObjectHandle` anymore so users access data in object through ObjectProxy. One less data type to deal with for CAS users and ObjectProxy has slightly better interfaces to type as well.
> > 
> > `ObjectHandle` is type for CAS implementor only now to represent loaded object.
> > I'd have expected clients to work with both `ObjectRef` and `ObjectHandle`
> 
> I don't think `ObjectHandle` has any value for clients, clients will be using `ObjectProxy`.
> 
> > an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef)
> 
> Something to highlight about this: For a "remote CAS" implementation you practically don't know that "an object is known to exist" until you try to load it. From that respect there is essentially no functional difference between `CASID` and `ObjectRef` in such a context.
> 
Ah, that sounds great (I apologize for not reading the patch itself). Agreed with @benlangmuir that LoadedObjectHandle seems pretty reasonable if clients never have to use it.

Does my understanding the roles still stand? (Maybe that example could/should be used somewhere in the docs to clarify.)

Also:
- Do we still need CASID? Why not just use `std::string`?
- If we don't need CASID and can delete it, WDYT of renaming ObjectRef to ObjectID? (It could be thought of as an "opaque" identifier.)
- Another idea: rename `ObjectRef` to `ObjectPtr`?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133716/new/

https://reviews.llvm.org/D133716