[Parallel_libs-commits] [PATCH] D25701: Initial check-in of Acxxel (StreamExecutor renamed)

Wed Oct 19 10:00:27 PDT 2016

jhen added inline comments.

================
Comment at: acxxel/cuda_acxxel.cpp:29
+/// Index of active device for this thread.
+thread_local int ActiveDeviceIndex;
+
----------------
jlebar wrote:
> Does this have to be per-thread rather than per stream?  Kind of a bummer that one thread can't operate on two streams for different devices (among other things).
> 
> At least maybe setActiveDevice should be called setActiveDeviceForThread.
TL;DR the CUDA driver library forces us to do it per thread or to suffer a performance penalty.

I chose it to be per thread for efficiency reasons. The old StreamExecutor tried to do it per stream, and this led to extra library calls for every operation to check that the current device was equal to the one expected by the Stream. Ideally we would just store a "context" in the stream, the context would know which device it belonged to, and the context would be used to launch all work, but unfortunately that is not how the CUDA driver library works. Instead, the CUDA driver library works a lot like this where it allows you to set the active context, but then you don't pass the context to each method call, instead, it just uses the active context.

It actually doesn't prevent a thread from using streams from different devices. It just forces that thread to do the "active device" bookkeeping on its own. Want to use stream0? First set the active device to device0. Want to go back to stream1? Remember to set the active device back to device1.

I did change the name as you suggested, though.

================
Comment at: acxxel/cuda_acxxel.cpp:193
+
+  ActiveDeviceIndex = 0;
+  return CUDAPlatform(Contexts);
----------------
jlebar wrote:
> How does this work, exactly?  We set this thread's active device index to 0, but new threads have it as uninitialized memory?  That seems not so good.
Good point. I want every thread to start with active device set to zero because that's the way the CUDA runtime does it. I moved this initialization to the declaration of ActiveDeviceIndex. Based on some experiments and my interpretation of the standard, that should do the right thing.

https://reviews.llvm.org/D25701