[Parallel_libs-commits] [PATCH] D25701: Initial check-in of Acxxel (StreamExecutor renamed)

Tue Oct 18 11:19:10 PDT 2016

jhen added a comment.

In https://reviews.llvm.org/D25701#572395, @jlebar wrote:

> The one thing I really miss from the old SE is not checking the error at every line.  I wonder if we could say that errors carry forward just like they used to?  Or maybe they do actually carry forward and I don't need to have an error check on every line -- I haven't gotten to the implementation yet.  :)

You were right that it did require checking the error at every line. To address this, I've added in a thread_local variable to keep track of the first error status. With this, users can make as many calls as they want without checking the error, then they can do a final check that nothing went wrong.

What do you think?

================
Comment at: acxxel/acxxel.h:57
+///
+/// Acxxel functions as a drop-in replacement for the standard CUDA runtime
+/// library and interoperates seamlessly with kernel calls.
----------------
jlebar wrote:
> Is this true?  Like Acxxel exposes a different interface from the standard CUDA runtime library, so maybe it's not a "drop-in" replacement.  But maybe it's a "replacement", and maybe we should say it doesn't require libcuda.
I changed it to "modern replacement". Unfortunately it does require libcuda and it also requires libcudart right now to access pointers for __device__ variables. We can maybe get rid of the libcudart dependency for clang in the future by including a real drop-in replacement for libcudart, but it's not at that point now.

================
Comment at: acxxel/acxxel.h:114
+  bool DeviceMap = false;
+  bool IOMemory = false;
+};
----------------
jlebar wrote:
> It's not obvious to me what these three mean; maybe they should have comments?
I removed these and the config structs below because they were too CUDA-specific, and I don't think they were very helpful anyway.

================
Comment at: acxxel/acxxel.h:193
+  /// There are no guarantees about ordering for work that is enqueued onto the
+  /// stream while this function is executing. That work may run at the same
+  /// time as the callback.
----------------
jlebar wrote:
> Which function -- the callback, or addCallback()?  (I am legitimately unsure.)
Now explicitly says `addCallback` to clarify.

================
Comment at: acxxel/acxxel.h:269
+  /// Enqueues the event on a stream.
+  Status record(Stream &Stream);
+
----------------
jlebar wrote:
> This strikes me kind of as a weird name; the stream "records" the event?  But I don't have a better suggestion offhand, and if it's consistent with prior art, maybe it's fine.
It's CUDA's name for the operation, but I don't like it either. It's misleading, so I'm renaming it to "enqueue".

================
Comment at: acxxel/acxxel.h:395
+    using DstElementTy =
+        typename std::remove_reference<DeviceDstTy>::type::value_type;
+    static_assert(std::is_same<SrcElementTy, DstElementTy>::value,
----------------
jlebar wrote:
> You only care about the type of `*DeviceSrc.data()` and `*DeviceDst.data()`, right?  Maybe we could just use `decltype` instead of ::value_type, so we'll be compatible with more types.
> 
> (If you don't want to type out the decltype expression over and over, I suppose we could even have our own traits class.  But that may be overkill.)
Actually, I only want `DeviceSrcTy` and `DeviceDstTy` to be either `DeviceMemory<T>` or `DeviceMemorySpan<T>`, neither of which has a `data` member function, so I think this is doing the right thing.

================
Comment at: acxxel/acxxel.h:478
+
+  template <typename T, typename DeviceSrcTy>
+  Status copyDtoPtrH(DeviceSrcTy &&DeviceSrc, T *HostDst,
----------------
jlebar wrote:
> Probably worth adding a comment indicating that if DeviceSrc's size is less than ElementCount, we copy only the first ElementCount elems -- I wasn't sure if that would be an error or what.
Function deleted, as suggested below.

================
Comment at: acxxel/acxxel.h:480
+  Status copyDtoPtrH(DeviceSrcTy &&DeviceSrc, T *HostDst,
+                     ptrdiff_t ElementCount) {
+    using SrcElementTy =
----------------
jlebar wrote:
> Is this better than requiring us to go through a span?  That would shrink the API a bit, and (I would hope) make it more obvious what happens when ElementCount doesn't match DeviceSrc's size.
We can get rid of this.

================
Comment at: acxxel/acxxel.h:538
+  Status copyPtrHtoD(const T *HostSrc, DeviceDstTy &DeviceDst,
+                     ptrdiff_t ElementCount) {
+    using DstElementTy =
----------------
jlebar wrote:
> Similar comment here.
Deleted this function.

================
Comment at: acxxel/acxxel.h:598
+          typename std::remove_reference<decltype(Cont[0])>::type>> {
+    using ValueType = typename std::remove_reference<decltype(Cont[0])>::type;
+    Span<ValueType> Span(Cont);
----------------
jlebar wrote:
> Do we actually care about operator[] as opposed to the `data` function?  That's surprising if so...  Perhaps we should explicitly write down our "concepts" somewhere.
> 
> At least let's add a comment to the header indicating what "concept" Container needs to fulfill.
I changed this to checking `data` instead, and I added a comment to try to describe what is expected of `Container`.

It turns out I was using this for arrays as well as other containers. That doesn't work with `data` member function, so I added an overload for arrays as well.

================
Comment at: acxxel/acxxel.h:643
+                             Span<void *> Arguments, Span<size_t> ArgumentSizes,
+                             size_t SharedMemoryBytes = 0) {
+    return rawEnqueueKernelLaunch(TheStream.TheHandle.get(),
----------------
jlebar wrote:
> I am unclear how this kernel-launching stuff relates to the comment at the top of file saying we don't do this...  Is this for opencl only?
I can't seem to find the comment at the top of the file that you mention here. Can you be a bit more specific about which comment you mean?

================
Comment at: acxxel/acxxel.h:713
+  virtual Expected<void *> rawMallocH(ptrdiff_t ByteCount,
+                                      AsyncHostMemoryConfig Config) = 0;
+  virtual HandleDestructor getFreeHostMemoryHandleDestructor() = 0;
----------------
jlebar wrote:
> Not sure this is a good name, because it makes it sound like it's just calling malloc, when in fact it's also registering the memory.
Changed the name to `rawMallocRegisteredH`.

================
Comment at: acxxel/acxxel.h:757
+
+template <typename DeviceSrcTy, typename DeviceDstTy>
+Status Stream::asyncCopyDToD(DeviceSrcTy &&DeviceSrc, DeviceDstTy &&DeviceDst) {
----------------
jlebar wrote:
> Nit, it seems kind of inconsistent that we declare these out-of-line but the sync versions inline.
I moved the sync copies out of line as well.

================
Comment at: acxxel/acxxel.h:963
+template <typename T>
+DeviceMemory<T> &DeviceMemory<T>::operator=(DeviceMemory &&) noexcept = default;
+
----------------
jlebar wrote:
> Remind me why we can't do these inside the class definition?
clang-tidy doesn't like it. clang-tidy wants them to be noexcept, but it doesn't accept the noexcept annotation if they are declared inline. I think you had some ideas about how this might make sense because default inline means that it might not be created, but default out of line was more strict.

================
Comment at: acxxel/acxxel.h:1011
+      auto Destructor = ThePlatform->getDeviceMemorySpanHandleDestructor();
+      Destructor(const_cast<value_type *>(TheSpanHandle));
+    }
----------------
jlebar wrote:
> Hm, this is two indirect calls, one to get the function pointer, and the other to call it.
> 
> We could make this just one virtual call, but that would be inconsistent with how you do literally everything else here.  So I guess this is fine as-is.  If it becomes a performance issue, we can address it then.
I don't think it's a big deal that it's a bit inconsistent with the other destructors because `DeviceMemorySpan` doesn't own its memory like the others. I switched it to using just a single virtual call.

================
Comment at: acxxel/acxxel.h:1029
+
+  void *baseHandle() const noexcept {
+    return static_cast<void *>(const_cast<value_type *>(TheHandle));
----------------
jlebar wrote:
> This needs noexcept?  That's...interesting.  Maybe deserves a comment.
I think that got put there by accident. It's gone now.

================
Comment at: acxxel/acxxel.h:1045
+    if (!Valid) {
+      std::terminate();
+    }
----------------
jlebar wrote:
> Interesting -- I presume you're doing this because that's what std::span does?  Does std::span not have a way to slice that doesn't do a bounds check?
> 
> Anyway I think this is very likely fine for our use-case.
Yes, I just copied the span API here. I didn't see any way for it to go without bounds checking.

================
Comment at: acxxel/acxxel.h:1077
+
+  explicit DeviceMemorySpan(Platform *ThePlatform, pointer AHandle,
+                            index_type Size, index_type Offset)
----------------
jlebar wrote:
> Do we need "explicit" here?  It makes a subtle difference, but unless you're going for it, maybe we just leave it off?
It was just a leftover from the development process. It is gone now.

================
Comment at: acxxel/acxxel.h:1160
+        static_cast<ElementType *>(ThePointer.get())[I].~ElementType();
+      }
+    }
----------------
jlebar wrote:
>      T *Memory = new (MaybeMemory.getValue()) T[ElementCount];
> 
> Oh, hm.  Are you sure this is the right way to undo your allocation above?  Like, we don't have to somehow delete the placement new'ed array itself?  (I don't see a way to do it on cppreference.)
My understanding is that the destructor of the `ThePointer` member will be called when this function ends. The `ThePointer` member is a `std::unique_ptr` with a custom deleter, and the custom deleter will call out to the platform to "free" the memory owned by `ThePointer`. So I don't think there is any memory leak here.

https://reviews.llvm.org/D25701