[Parallel_libs-commits] [PATCH] D24596: [SE] Support CUDA dynamic shared memory
Justin Lebar via Parallel_libs-commits
parallel_libs-commits at lists.llvm.org
Thu Sep 15 08:54:29 PDT 2016
jlebar accepted this revision.
This revision is now accepted and ready to land.
================
Comment at: streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp:167
@@ +166,3 @@
+ auto Launch = [Function, Stream, BlockSize,
+ GridSize](size_t SharedMemoryBytes, void **ArgumentAddresses) {
+ return CUresultToError(
----------------
Huh, does clang-format actually do this? If so maybe that's worth filing a bug -- that is a strange choice.
================
Comment at: streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp:189
@@ +188,3 @@
+ NonSharedArgumentAddresses[NonSharedIndex++] = ArgumentAddresses[I];
+ return Launch(SharedMemoryBytes, NonSharedArgumentAddresses.data());
+ }
----------------
This whole dance is going to destroy the beautiful efficiency gains you were after, right?
It sort of seems like the only way to make this work would be to have a different args-packing class for each platform. But I am not sure how to do that without introducing virtual function calls, which would *also* destroy your beautiful efficiency gains.
At least let's use llvm::SmallVector so we don't have to malloc anything. And maybe add a comment that we may need to come back and improve this?
================
Comment at: streamexecutor/unittests/CoreTests/CUDATest.cpp:142
@@ +141,3 @@
+}();
+} // namespace __compilergen
+
----------------
Doesn't match ns name above. (It's going to be technically UB for something other than the compiler to put anything into __foo.)
https://reviews.llvm.org/D24596
More information about the Parallel_libs-commits
mailing list