[Parallel_libs-commits] [PATCH] D24596: [SE] Support CUDA dynamic shared memory

Thu Sep 15 08:54:29 PDT 2016

jlebar accepted this revision.
This revision is now accepted and ready to land.

================
Comment at: streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp:167
@@ +166,3 @@
+  auto Launch = [Function, Stream, BlockSize,
+                 GridSize](size_t SharedMemoryBytes, void **ArgumentAddresses) {
+    return CUresultToError(
----------------
Huh, does clang-format actually do this?  If so maybe that's worth filing a bug -- that is a strange choice.

================
Comment at: streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp:189
@@ +188,3 @@
+        NonSharedArgumentAddresses[NonSharedIndex++] = ArgumentAddresses[I];
+    return Launch(SharedMemoryBytes, NonSharedArgumentAddresses.data());
+  }
----------------
This whole dance is going to destroy the beautiful efficiency gains you were after, right?

It sort of seems like the only way to make this work would be to have a different args-packing class for each platform.  But I am not sure how to do that without introducing virtual function calls, which would *also* destroy your beautiful efficiency gains.

At least let's use llvm::SmallVector so we don't have to malloc anything.  And maybe add a comment that we may need to come back and improve this?

================
Comment at: streamexecutor/unittests/CoreTests/CUDATest.cpp:142
@@ +141,3 @@
+}();
+} // namespace __compilergen
+
----------------
Doesn't match ns name above.  (It's going to be technically UB for something other than the compiler to put anything into __foo.)

https://reviews.llvm.org/D24596