[Parallel_libs-commits] [PATCH] D24353: [SE] RegisteredHostMemory for async device copies

Thu Sep 8 19:20:17 PDT 2016

jlebar added a comment.

I talked to Chandler about this slicing thing...

His suggestion was to give up on deducing T in e.g. thenCopyH2D.  Instead, we'd write function signatures like the following:

  template <typename HostTy, typename DeviceTy>
  thenCopyH2D(HostTy &&Src, DeviceTy &&Dst);

We have constraints on HostTy and DeviceTy that we still need to enforce.  Maybe our constraint is that they both expose a typedef named ElementTy and they are convertible to `RegisteredHostMemorySlice<ElementTy>` and `GlobalMemorySlice<ElementTy>` respectively.  In this case we'd write something like

  static_assert(std::is_same<HostTy::ElementTy, DeviceTy::ElementTy>::value, "...");
  RegisteredHostMemorySlice<HostTy::ElemTy> HostSlice(Src);
  GlobalMemorySlice<DeviceTy::ElemTy> DeviceSlice(Dst);

If this compiles, we're good to go.  If it doesn't compile, that means you passed bad types to this function.

One nice thing about this is that it would let us unify the mutable / immutable slices into a single type (I think):

  RegisteredHostMemorySlice<const HostTy::ElemTy> ImmutableSlice;
  RegisteredHostMemorySlice<HostTy::ElemTy> MutableSlice;

This would solve our combinatorial explosion problem in Stream.h, and wouldn't require complex enable_if shenanigans.  (Although it involves its own shenanigans, I suppose.)

What do you think?

================
Comment at: streamexecutor/include/streamexecutor/HostMemory.h:32
@@ +31,3 @@
+
+/// An immutable slice of registered host memory.
+///
----------------
Can we define in what way the memory is "registered" here?

================
Comment at: streamexecutor/include/streamexecutor/HostMemory.h:161
@@ +160,3 @@
+  ElemT *getPointer() { return static_cast<ElemT *>(Pointer); }
+  size_t getElementCount() const { return ByteCount / sizeof(ElemT); }
+
----------------
It doesn't matter much, but this base class is kind of annoying, partially because we have the added overhead of this division (shift) every time we get the size, but more because it's another class kicking around.

It seems like the main reason we need the base class is to make the out-of-line destructor work without including device.h here.  But we can do something similar without all the fuss of a base class.  We could define an out-of-line function in a "detail" namespace, or we could have a base class with just one function, defined out-of-line, that we call from our destructor.

https://reviews.llvm.org/D24353