[Parallel_libs-commits] [PATCH] D25701: Initial check-in of Acxxel (StreamExecutor renamed)
Jason Henline via Parallel_libs-commits
parallel_libs-commits at lists.llvm.org
Tue Oct 18 11:19:10 PDT 2016
jhen added a comment.
In https://reviews.llvm.org/D25701#572395, @jlebar wrote:
> The one thing I really miss from the old SE is not checking the error at every line. I wonder if we could say that errors carry forward just like they used to? Or maybe they do actually carry forward and I don't need to have an error check on every line -- I haven't gotten to the implementation yet. :)
You were right that it did require checking the error at every line. To address this, I've added in a thread_local variable to keep track of the first error status. With this, users can make as many calls as they want without checking the error, then they can do a final check that nothing went wrong.
What do you think?
================
Comment at: acxxel/acxxel.h:57
+///
+/// Acxxel functions as a drop-in replacement for the standard CUDA runtime
+/// library and interoperates seamlessly with kernel calls.
----------------
jlebar wrote:
> Is this true? Like Acxxel exposes a different interface from the standard CUDA runtime library, so maybe it's not a "drop-in" replacement. But maybe it's a "replacement", and maybe we should say it doesn't require libcuda.
I changed it to "modern replacement". Unfortunately it does require libcuda and it also requires libcudart right now to access pointers for __device__ variables. We can maybe get rid of the libcudart dependency for clang in the future by including a real drop-in replacement for libcudart, but it's not at that point now.
================
Comment at: acxxel/acxxel.h:114
+ bool DeviceMap = false;
+ bool IOMemory = false;
+};
----------------
jlebar wrote:
> It's not obvious to me what these three mean; maybe they should have comments?
I removed these and the config structs below because they were too CUDA-specific, and I don't think they were very helpful anyway.
================
Comment at: acxxel/acxxel.h:193
+ /// There are no guarantees about ordering for work that is enqueued onto the
+ /// stream while this function is executing. That work may run at the same
+ /// time as the callback.
----------------
jlebar wrote:
> Which function -- the callback, or addCallback()? (I am legitimately unsure.)
Now explicitly says `addCallback` to clarify.
================
Comment at: acxxel/acxxel.h:269
+ /// Enqueues the event on a stream.
+ Status record(Stream &Stream);
+
----------------
jlebar wrote:
> This strikes me kind of as a weird name; the stream "records" the event? But I don't have a better suggestion offhand, and if it's consistent with prior art, maybe it's fine.
It's CUDA's name for the operation, but I don't like it either. It's misleading, so I'm renaming it to "enqueue".
================
Comment at: acxxel/acxxel.h:395
+ using DstElementTy =
+ typename std::remove_reference<DeviceDstTy>::type::value_type;
+ static_assert(std::is_same<SrcElementTy, DstElementTy>::value,
----------------
jlebar wrote:
> You only care about the type of `*DeviceSrc.data()` and `*DeviceDst.data()`, right? Maybe we could just use `decltype` instead of ::value_type, so we'll be compatible with more types.
>
> (If you don't want to type out the decltype expression over and over, I suppose we could even have our own traits class. But that may be overkill.)
Actually, I only want `DeviceSrcTy` and `DeviceDstTy` to be either `DeviceMemory<T>` or `DeviceMemorySpan<T>`, neither of which has a `data` member function, so I think this is doing the right thing.
================
Comment at: acxxel/acxxel.h:478
+
+ template <typename T, typename DeviceSrcTy>
+ Status copyDtoPtrH(DeviceSrcTy &&DeviceSrc, T *HostDst,
----------------
jlebar wrote:
> Probably worth adding a comment indicating that if DeviceSrc's size is less than ElementCount, we copy only the first ElementCount elems -- I wasn't sure if that would be an error or what.
Function deleted, as suggested below.
================
Comment at: acxxel/acxxel.h:480
+ Status copyDtoPtrH(DeviceSrcTy &&DeviceSrc, T *HostDst,
+ ptrdiff_t ElementCount) {
+ using SrcElementTy =
----------------
jlebar wrote:
> Is this better than requiring us to go through a span? That would shrink the API a bit, and (I would hope) make it more obvious what happens when ElementCount doesn't match DeviceSrc's size.
We can get rid of this.
================
Comment at: acxxel/acxxel.h:538
+ Status copyPtrHtoD(const T *HostSrc, DeviceDstTy &DeviceDst,
+ ptrdiff_t ElementCount) {
+ using DstElementTy =
----------------
jlebar wrote:
> Similar comment here.
Deleted this function.
================
Comment at: acxxel/acxxel.h:598
+ typename std::remove_reference<decltype(Cont[0])>::type>> {
+ using ValueType = typename std::remove_reference<decltype(Cont[0])>::type;
+ Span<ValueType> Span(Cont);
----------------
jlebar wrote:
> Do we actually care about operator[] as opposed to the `data` function? That's surprising if so... Perhaps we should explicitly write down our "concepts" somewhere.
>
> At least let's add a comment to the header indicating what "concept" Container needs to fulfill.
I changed this to checking `data` instead, and I added a comment to try to describe what is expected of `Container`.
It turns out I was using this for arrays as well as other containers. That doesn't work with `data` member function, so I added an overload for arrays as well.
================
Comment at: acxxel/acxxel.h:643
+ Span<void *> Arguments, Span<size_t> ArgumentSizes,
+ size_t SharedMemoryBytes = 0) {
+ return rawEnqueueKernelLaunch(TheStream.TheHandle.get(),
----------------
jlebar wrote:
> I am unclear how this kernel-launching stuff relates to the comment at the top of file saying we don't do this... Is this for opencl only?
I can't seem to find the comment at the top of the file that you mention here. Can you be a bit more specific about which comment you mean?
================
Comment at: acxxel/acxxel.h:713
+ virtual Expected<void *> rawMallocH(ptrdiff_t ByteCount,
+ AsyncHostMemoryConfig Config) = 0;
+ virtual HandleDestructor getFreeHostMemoryHandleDestructor() = 0;
----------------
jlebar wrote:
> Not sure this is a good name, because it makes it sound like it's just calling malloc, when in fact it's also registering the memory.
Changed the name to `rawMallocRegisteredH`.
================
Comment at: acxxel/acxxel.h:757
+
+template <typename DeviceSrcTy, typename DeviceDstTy>
+Status Stream::asyncCopyDToD(DeviceSrcTy &&DeviceSrc, DeviceDstTy &&DeviceDst) {
----------------
jlebar wrote:
> Nit, it seems kind of inconsistent that we declare these out-of-line but the sync versions inline.
I moved the sync copies out of line as well.
================
Comment at: acxxel/acxxel.h:963
+template <typename T>
+DeviceMemory<T> &DeviceMemory<T>::operator=(DeviceMemory &&) noexcept = default;
+
----------------
jlebar wrote:
> Remind me why we can't do these inside the class definition?
clang-tidy doesn't like it. clang-tidy wants them to be noexcept, but it doesn't accept the noexcept annotation if they are declared inline. I think you had some ideas about how this might make sense because default inline means that it might not be created, but default out of line was more strict.
================
Comment at: acxxel/acxxel.h:1011
+ auto Destructor = ThePlatform->getDeviceMemorySpanHandleDestructor();
+ Destructor(const_cast<value_type *>(TheSpanHandle));
+ }
----------------
jlebar wrote:
> Hm, this is two indirect calls, one to get the function pointer, and the other to call it.
>
> We could make this just one virtual call, but that would be inconsistent with how you do literally everything else here. So I guess this is fine as-is. If it becomes a performance issue, we can address it then.
I don't think it's a big deal that it's a bit inconsistent with the other destructors because `DeviceMemorySpan` doesn't own its memory like the others. I switched it to using just a single virtual call.
================
Comment at: acxxel/acxxel.h:1029
+
+ void *baseHandle() const noexcept {
+ return static_cast<void *>(const_cast<value_type *>(TheHandle));
----------------
jlebar wrote:
> This needs noexcept? That's...interesting. Maybe deserves a comment.
I think that got put there by accident. It's gone now.
================
Comment at: acxxel/acxxel.h:1045
+ if (!Valid) {
+ std::terminate();
+ }
----------------
jlebar wrote:
> Interesting -- I presume you're doing this because that's what std::span does? Does std::span not have a way to slice that doesn't do a bounds check?
>
> Anyway I think this is very likely fine for our use-case.
Yes, I just copied the span API here. I didn't see any way for it to go without bounds checking.
================
Comment at: acxxel/acxxel.h:1077
+
+ explicit DeviceMemorySpan(Platform *ThePlatform, pointer AHandle,
+ index_type Size, index_type Offset)
----------------
jlebar wrote:
> Do we need "explicit" here? It makes a subtle difference, but unless you're going for it, maybe we just leave it off?
It was just a leftover from the development process. It is gone now.
================
Comment at: acxxel/acxxel.h:1160
+ static_cast<ElementType *>(ThePointer.get())[I].~ElementType();
+ }
+ }
----------------
jlebar wrote:
> T *Memory = new (MaybeMemory.getValue()) T[ElementCount];
>
> Oh, hm. Are you sure this is the right way to undo your allocation above? Like, we don't have to somehow delete the placement new'ed array itself? (I don't see a way to do it on cppreference.)
My understanding is that the destructor of the `ThePointer` member will be called when this function ends. The `ThePointer` member is a `std::unique_ptr` with a custom deleter, and the custom deleter will call out to the platform to "free" the memory owned by `ThePointer`. So I don't think there is any memory leak here.
https://reviews.llvm.org/D25701
More information about the Parallel_libs-commits
mailing list