[Parallel_libs-commits] [PATCH] D23211: [StreamExecutor] Add DeviceMemory and kernel arg packing

Fri Aug 5 11:09:03 PDT 2016

jlebar added inline comments.

================
Comment at: streamexecutor/include/streamexecutor/DeviceMemory.h:72
@@ +71,3 @@
+
+  /// Assignment copyable like a pointer.
+  DeviceMemoryBase &operator=(const DeviceMemoryBase &) = default;
----------------
"Copy-assignable", same below.

================
Comment at: streamexecutor/include/streamexecutor/DeviceMemory.h:82
@@ +81,3 @@
+  /// device virtual address space, but is platform-dependent.
+  const void *getHandle() const { return Handle; }
+
----------------
Should we have a non-const version of this too?  (Not sure if that makes sense.)

================
Comment at: streamexecutor/include/streamexecutor/DeviceMemory.h:152
@@ +151,3 @@
+/// The treatment of shared memory in StreamExecutor matches the way it is done
+/// in OpenCL where a kernel takes any number of shared memory sizes as kernel
+/// function arguments.
----------------
OpenCL, where

================
Comment at: streamexecutor/include/streamexecutor/DeviceMemory.h:157
@@ +156,3 @@
+/// StreamExecutor handles this by allowing CUDA kernel signatures that take
+/// multiple SharedDeviceMemory arguments, and by simply adding together all the
+/// shared memory sizes to get the final shared memory size that is used to
----------------
s/and by/and/ (we already have a "by" earlier)

================
Comment at: streamexecutor/include/streamexecutor/DeviceMemory.h:161
@@ +160,3 @@
+template <typename ElemT>
+class SharedDeviceMemory : public DeviceMemory<ElemT> {
+public:
----------------
Some high-level questions about this API:

* Why does SharedDeviceMemory inherit from DeviceMemoryBase, when its handle is always null?  It seems like SharedDeviceMemory is inherently a different thing because it never has a handle.

* Do we need to allow null handles on DeviceMemoryBase (other than for shared memory)?  In general if we can reduce the number of states in our state machine, that makes life easier for everyone.

* Same for size-zero device/shared memory objects.

* It seems inconsistent that we allow untyped device-memory handles, but do not allow untyped shared-memory handles.  Why is that?

================
Comment at: streamexecutor/include/streamexecutor/PackedKernelArgumentArray.h:32
@@ +31,3 @@
+/// stored as a struct of arrays rather than an array of structs because CUDA
+/// kernel launches take an array of argument addresses. Having created the
+/// array of argument addresses here, no further work will need to be done in
----------------
> CUDA kernel launches take an array of argument addresses

The code in CGNVCUDARuntime::emitDeviceStubBody doesn't look like this, but maybe this is correct from the perspective of the CUDA driver API.  It may be worth clarifying, as presumably people will be more familiar with the runtime API.

================
Comment at: streamexecutor/include/streamexecutor/PackedKernelArgumentArray.h:47
@@ +46,3 @@
+/// * array of argument types, and
+/// * shared pointer count.
+/// This information should be enough to allow any platform to launch the kernel
----------------
Nit, maybe indent a bit?

================
Comment at: streamexecutor/lib/unittests/PackedKernelArgumentArrayTest.cpp:33
@@ +32,3 @@
+  return se::PackedKernelArgumentArray<ParameterTs...>(Arguments...);
+}
+
----------------
This seems useful and idiomatic (e.g. make_tuple) -- maybe it should be a public function?

================
Comment at: streamexecutor/lib/unittests/PackedKernelArgumentArrayTest.cpp:55
@@ +54,3 @@
+  se::DeviceMemory<int> DeviceMemory;
+  se::SharedDeviceMemory<int> SharedDeviceMemory;
+};
----------------
It's kind of confusing to me that we give these variables the same names as their types.  Maybe if we abbreviated to "DeviceMemBase" that would be easier.  Or perhaps "UntypedDeviceMem", "TypedDeviceMem", "SharedDeviceMem"?

================
Comment at: streamexecutor/lib/unittests/PackedKernelArgumentArrayTest.cpp:67
@@ +66,3 @@
+  EXPECT_EQ(ExpectedSize, Observed.getSize(Index)) << "Index: " << Index;
+  EXPECT_EQ(ExpectedType, Observed.getType(Index)) << "Index: " << Index;
+}
----------------
You could use SCOPED_TRACE, maybe it's a bit cleaner.

================
Comment at: streamexecutor/lib/unittests/PackedKernelArgumentArrayTest.cpp:169
@@ +168,3 @@
+}
+
+} // namespace
----------------
It looks like we don't exercise all of the functions in the pack, particularly the ones to get the raw arrays (maybe others too, I didn't look hard).

https://reviews.llvm.org/D23211