[Parallel_libs-commits] [PATCH] D24596: [SE] Support CUDA dynamic shared memory

Thu Sep 15 09:53:30 PDT 2016

jhen added inline comments.

================
Comment at: streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp:167
@@ +166,3 @@
+  auto Launch = [Function, Stream, BlockSize,
+                 GridSize](size_t SharedMemoryBytes, void **ArgumentAddresses) {
+    return CUresultToError(
----------------
jlebar wrote:
> Huh, does clang-format actually do this?  If so maybe that's worth filing a bug -- that is a strange choice.
Yes, clang-format actually does this. I'll create a simple reproducer and file a bug.

================
Comment at: streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp:189
@@ +188,3 @@
+        NonSharedArgumentAddresses[NonSharedIndex++] = ArgumentAddresses[I];
+    return Launch(SharedMemoryBytes, NonSharedArgumentAddresses.data());
+  }
----------------
jlebar wrote:
> This whole dance is going to destroy the beautiful efficiency gains you were after, right?
> 
> It sort of seems like the only way to make this work would be to have a different args-packing class for each platform.  But I am not sure how to do that without introducing virtual function calls, which would *also* destroy your beautiful efficiency gains.
> 
> At least let's use llvm::SmallVector so we don't have to malloc anything.  And maybe add a comment that we may need to come back and improve this?
While optimizing this internally, our approach was that the dynamic shared memory case was not common, and we would accept being inefficient in that case, as long as we were efficient in the case of no dynamic shared memory.

So, the idea is that, other than the check for `ArgumentArray.getArgumentCount() == 0`, the no-dynamic-shared-memory case should take advantage of the efficiency gains in the quirky packed argument array design. Definitely let me know if I've done something that has broken that case.

For the general dynamic shared memory case, I couldn't think of a good way to make it efficient for CUDA and OpenCL at the same time. As you mentioned, it might require virtual function calls, which would hurt both cases. But again, I think we're OK to be less efficient in this case.

Actually, now that I think of it, we could be much more efficient in the most important specific case of dynamic shared memory--the case where there is only one dynamic shared memory argument, and it is the first one. That would work for all CUDA cases, and OpenCL users could take advantage of it as well by choosing to write their kernels in that way. For now, I've switched to using llvm::SmallVector (thanks for that idea!) and wrote a comment describing how we might improve this in the future.

================
Comment at: streamexecutor/unittests/CoreTests/CUDATest.cpp:142
@@ +141,3 @@
+}();
+} // namespace __compilergen
+
----------------
jlebar wrote:
> Doesn't match ns name above.  (It's going to be technically UB for something other than the compiler to put anything into __foo.)
Thanks for catching that. I changed it from the example code to avoid UB, but I missed the comment.

https://reviews.llvm.org/D24596