[PATCH] D133716: [CAS] Add LLVMCAS library with InMemoryCAS implementation
Argyrios Kyrtzidis via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Dec 8 13:00:16 PST 2022
akyrtzi added inline comments.
================
Comment at: llvm/include/llvm/CAS/CASReference.h:162
+/// data and references.
+class ObjectHandle : public ReferenceBase {
+public:
----------------
benlangmuir wrote:
> dexonsmith wrote:
> > dexonsmith wrote:
> > > akyrtzi wrote:
> > > > steven_wu wrote:
> > > > > dexonsmith wrote:
> > > > > > dexonsmith wrote:
> > > > > > > benlangmuir wrote:
> > > > > > > > steven_wu wrote:
> > > > > > > > > dblaikie wrote:
> > > > > > > > > > steven_wu wrote:
> > > > > > > > > > > dblaikie wrote:
> > > > > > > > > > > > Since `Handle` and `Ref` aren't especially self documenting (while reading the docs for `Ref` I wondered if it should be called `Handle` - until I got to the bit where it explained that that exists too/separately) - is there something that'd be more explicit about the difference between these? (like, I wonder if `ObjectHandle` could be `Object`? - not sure that'd make `ObjectRef` more obvious, though... )
> > > > > > > > > > > `ObjectRef` is just a reference, and you can't access anything in the object it points to, until you load the object using ObjectRef, which will turn into an `ObjectHandle`. The load can have latency or can fail.
> > > > > > > > > > >
> > > > > > > > > > > Maybe the comments in ObjectStore.h is a better explanation with a bigger picture? Or should we put all the documentation into one place so it is easier to read?
> > > > > > > > > > Still feels like the names could be more self-descriptive, even with/independent of documentation improvements. `Ref` and `Handle` both tend to connote a thing you can use to inspect the thing being referred to or handled. (if anything, I guess I'd actually expect a Handle to be the inaccessible version and the Ref to be the accessible version (thinking StringRef, ArrayRef, etc - where the data is directly accessible through that abstraction, whereas handles sometimes you have to pass back into some other object/API to then get the data (like a file handle/file descriptor)))
> > > > > > > > > >
> > > > > > > > > > Seems like ObjectRef is maybe more suitably called ObjectID?
> > > > > > > > > I don't have strong opinion on names. Maybe @dexonsmith @benlangmuir @akyrtzi can also provide some feedbacks for names?
> > > > > > > > >
> > > > > > > > > In my opinion, `ObjectRef` is named that way because it is like a reference type (just an opaque pointer) and you need to "dereference" it to access underlying data, just that the `dereference` might fail in CAS. The difference between `ObjectRef` and `ObjectHandle` is mostly providing the flexibility for a remote CAS so there is no need to traffic all the data if you only dealing with refs.
> > > > > > > > >
> > > > > > > > > Also `ObjectRef` is the public type we encourage to use, just like other `Ref` types like `StringRef` and `ArrayRef`. It is cheap and allows quick fetch of the underlying data (at least for the builtin CAS we provided). It doesn't really contains ID, and you can't compare `Ref` from different ObjectStore.
> > > > > > > > >
> > > > > > > > > I thought about rename `CASID` to `ObjectID` to keep the name consistent, but there is no other more benefit. I am up for renaming types to make it more self-descriptive but I would like to reach an agreement before doing so because we have lots of downstream code needs to upstream and I would like to avoid repeated renames.
> > > > > > > > I agree with Steven; I think `ObjectRef` deserves the good name here, since that's the one that clients of an `ObjectStore` will work with. `ObjectHandle` is only used by //implementors// of an `ObjectStore`. I agree that having both `Ref` and `Handle` is confusing - maybe we should rename Handle to something more verbose and descriptive like `LoadedObjectRef`.
> > > > > > > >
> > > > > > > > > Seems like ObjectRef is maybe more suitably called ObjectID?
> > > > > > > >
> > > > > > > > I think this is confusing with `CASID`, which really is an ID. I don't have a strong opinion on whether `CASID` should be renamed to `ObjectID` or not for consistency.
> > > > > > > In my mind, the name `ObjectID` was "taken", since I originally thought we'd do a `s/CASID/ObjectID/g` at some point (maybe this patch / before upstreaming is the right time?). Maybe it would be even better to do `s/CASID/std::string/`.
> > > > > > >
> > > > > > > (Historical note: an earlier design had neither ObjectRef nor ObjectHandle, just CASID. CASIDs were tied to a specific CAS instance, had odd lifetimes, and tried to serve all three purposes. We figured out that this was awkward, especially for distributed/remote CAS implementations.)
> > > > > > >
> > > > > > > `ObjectRef` is a reference to an object (possibly an ID, possible some internal CAS pointer). As @steven_wu points out, having an `ObjectRef` doesn't necessarily mean you know anything about the content of an object.
> > > > > > >
> > > > > > > `ObjectHandle` promises that you can access the object.
> > > > > > >
> > > > > > > I think to explain the difference, it helps to have a specific example. If you have a few objects:
> > > > > > > ```
> > > > > > > object-id: 0
> > > > > > > data: "some small data"
> > > > > > >
> > > > > > > object-id: 1
> > > > > > > data: [4GB of data]
> > > > > > >
> > > > > > > object-id: 2
> > > > > > > data: "a header"
> > > > > > > refs: 0, 1
> > > > > > > ```
> > > > > > >
> > > > > > > If you have an `ObjectHandle` for object 2, then you can see that the data is "a header" and you can get the `ObjectRef`s for the objects it references: 0 and 1.
> > > > > > >
> > > > > > > Those `ObjectRef`s allow you to talk about objects 0 and 1 without necessarily loading / downloading them, which is important because object 1 is 4GB big. You can create a new object that references them, and you can get a serialized reference (CASID/ObjectID/std::string) for referring to it in other contexts.
> > > > > > >
> > > > > > > Or, you can "load" an `ObjectRef`, get an `ObjectHandle` for that object, and look at its content/refs. If you have a distributed CAS, or the object is big, or [etc.], this action might have significant latency. I imagine we'd eventually want to add an API to allow asynchronous loading.
> > > > > > >
> > > > > > > It's a good point that "ref" here bears no relation to the "ref" in `ArrayRef`. That makes it confusing. It is identifier-like, even if it doesn't know how to serialize itself into a string (you need the CAS instance to string-ify it for you).
> > > > > > >
> > > > > > > At a high level, we have three concepts:
> > > > > > > - a serialized, context-free identifier (currently CASID)
> > > > > > > - an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef)
> > > > > > > - a context-sensitive reference to an object known to "be local/loaded" (currently ObjectHandle)
> > > > > > >
> > > > > > > Maybe the following rename could work:
> > > > > > > 1. Delete CASID and replace its usage with `std::string`
> > > > > > > 2. Rename ObjectRef to ObjectID
> > > > > > > 3. Maybe: Rename ObjectHandle to ObjectRef
> > > > > > >
> > > > > > > I don't have a strong opinion about whether that results in a better world. The "current" names are kind of etched into my brain so it's hard to evaluate.
> > > > > > >
> > > > > > > @steven_wu and @benlangmuir, WDYT? Is there something I'm missing/forgetting that makes this a bad idea?
> > > > > > >
> > > > > > > @dblaikie, WDYT? Maybe you have other name ideas after the above explanation?
> > > > > > @benlangmuir, just seeing your reply. I'd have expected clients to work with both `ObjectRef` and `ObjectHandle`; but you have much more experience than I do. (I guess the `ObjectHandle` is only indirectly useful to clients. For any serious work, you want to create a proxy that wraps an `ObjectHandle` and navigates the content in a structured way.)
> > > > > >
> > > > > > Agreed that `ObjectRef` ought to have a good / easy-to-type name.
> > > > > @dexonsmith Correct, we clean up the interface that no public methods from `ObjectStore` return `ObjectHandle` anymore so users access data in object through ObjectProxy. One less data type to deal with for CAS users and ObjectProxy has slightly better interfaces to type as well.
> > > > >
> > > > > `ObjectHandle` is type for CAS implementor only now to represent loaded object.
> > > > > I'd have expected clients to work with both `ObjectRef` and `ObjectHandle`
> > > >
> > > > I don't think `ObjectHandle` has any value for clients, clients will be using `ObjectProxy`.
> > > >
> > > > > an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef)
> > > >
> > > > Something to highlight about this: For a "remote CAS" implementation you practically don't know that "an object is known to exist" until you try to load it. From that respect there is essentially no functional difference between `CASID` and `ObjectRef` in such a context.
> > > >
> > > Ah, that sounds great (I apologize for not reading the patch itself). Agreed with @benlangmuir that LoadedObjectHandle seems pretty reasonable if clients never have to use it.
> > >
> > > Does my understanding the roles still stand? (Maybe that example could/should be used somewhere in the docs to clarify.)
> > >
> > > Also:
> > > - Do we still need CASID? Why not just use `std::string`?
> > > - If we don't need CASID and can delete it, WDYT of renaming ObjectRef to ObjectID? (It could be thought of as an "opaque" identifier.)
> > > - Another idea: rename `ObjectRef` to `ObjectPtr`?
> > > Something to highlight about this: For a "remote CAS" implementation you practically don't know that "an object is known to exist" until you try to load it. From that respect there is essentially no functional difference between CASID and ObjectRef in such a context.
> > >
> >
> > Ah, is that because a remote CAS may have garbage-collected the object? So, then, `ObjectRef` is just an object we know how to point at?
> >
> > (Or do you just mean that loading could fail because the remote CAS goes down?)
> > Ah, is that because a remote CAS may have garbage-collected the object? So, then, ObjectRef is just an object we know how to point at?
> > (Or do you just mean that loading could fail because the remote CAS goes down?)
>
> We do not check whether an object exists at all when creating the ObjectRef for a remote CAS: it's a wasted round trip over the network since you can't guarantee the object still exists later, and don't get a performance win for knowing it existed in the past. This is different with the local in-memory and on-disk CASes, because they can check for existence cheaply and provide a faster ref implementation once we know the object exists.
>
> > - a serialized, context-free identifier (currently CASID)
>
> There are two levels here:
> - a serialized, context-free identifier (std::string, e.g. "llvmcas://<serialized hash>")
> - a context-sensitive identifier hash containing raw hash bytes (CASID) -- you need a CASContext to use it
>
> > an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef)
>
> Not known to exist anywhere, but otherwise yes.
>
> > a context-sensitive reference to an object known to "be local/loaded" (currently ObjectHandle)
>
> Correct.
>
> >
> > Also:
> > - Do we still need CASID? Why not just use `std::string`?
>
> So the question here is whether CASID is pulling its weight as a client-visible type. It should be more efficient than the serialized string, since you have the hash bytes immediately available rather than parsing them out; it can be smaller for the same reason. I don't know if have measured this -- @steven_wu, @akyrtzi any thoughts?
>
> > - If we don't need CASID and can delete it, WDYT of renaming ObjectRef to ObjectID? (It could be thought of as an "opaque" identifier.)
>
> If we can drop CASID, I'm fine with this.
>
> > - Another idea: rename `ObjectRef` to `ObjectPtr`?
>
> I find Ref clearer than Ptr for what it does.
I like the idea of dropping `CASID` entirely. I'm also fine with renaming `ObjectRef` to `ObjectID`.
I would also suggest that we change
```
virtual Expected<ObjectHandle> load(ObjectRef Ref)
```
to
```
virtual Expected<Optional<ObjectHandle>> load(ObjectRef Ref)
```
The optional indicates whether the object existed or not, and we reserve errors for catastrophic failures (e.g. "network went down"). The client could choose to turn the "object doesn't exist" `None` into an error (or not, it depends on the context) but this case should be distinguishable from `ObjectStore`'s API.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D133716/new/
https://reviews.llvm.org/D133716
More information about the llvm-commits
mailing list