[Parallel_libs-commits] [PATCH] D25701: Initial check-in of Acxxel (StreamExecutor renamed)

Fri Oct 21 18:15:42 PDT 2016

jlebar added a comment.

This all looks really good to me.

================
Comment at: acxxel/cuda_acxxel.cpp:29
+/// Index of active device for this thread.
+thread_local int ActiveDeviceIndex;
+
----------------
jhen wrote:
> jlebar wrote:
> > Does this have to be per-thread rather than per stream?  Kind of a bummer that one thread can't operate on two streams for different devices (among other things).
> > 
> > At least maybe setActiveDevice should be called setActiveDeviceForThread.
> TL;DR the CUDA driver library forces us to do it per thread or to suffer a performance penalty.
> 
> I chose it to be per thread for efficiency reasons. The old StreamExecutor tried to do it per stream, and this led to extra library calls for every operation to check that the current device was equal to the one expected by the Stream. Ideally we would just store a "context" in the stream, the context would know which device it belonged to, and the context would be used to launch all work, but unfortunately that is not how the CUDA driver library works. Instead, the CUDA driver library works a lot like this where it allows you to set the active context, but then you don't pass the context to each method call, instead, it just uses the active context.
> 
> It actually doesn't prevent a thread from using streams from different devices. It just forces that thread to do the "active device" bookkeeping on its own. Want to use stream0? First set the active device to device0. Want to go back to stream1? Remember to set the active device back to device1.
> 
> I did change the name as you suggested, though.
> I chose it to be per thread for efficiency reasons. 

OK, I am totally onboard with that -- zero-cost abstractions.

But maybe we can have assert()s that ensure that users are always switching to a stream's device before touching that stream?  If we have to have an `ifndef NDEBUG` variable or two inside the stream class to support this, that's fine by me.

We also need to have comments explaining this weird requirement.

================
Comment at: acxxel/examples/opencl_example.cpp:56
+  std::array<float, 3> Y = {{3.f, 4.f, 5.f}};
+  std::array<float, 3> Expected = {{3.f, 6.f, 9.f}};
+  saxpy(A, X, Y);
----------------
Nit, we can get away with single braces here, which I think is more natural.

================
Comment at: acxxel/examples/opencl_example.cpp:59
+  for (int I = 0; I < 3; ++I)
+    assert(X[I] == Expected[I]);
+}
----------------
If we're using assert(), we should #error if we're built with NDEBUG, or alternatively #undef NDEBUG as the first thing in this file (before any #includes).

================
Comment at: acxxel/opencl_acxxel.cpp:24
+namespace acxxel {
+namespace opencl {
+
----------------
I wonder if we need this extra namespace (same for cuda).  We don't actually expose anything other than the make-platform function in it.  Do we plan on exposing anything else?

================
Comment at: acxxel/opencl_acxxel.cpp:39
+struct OpenCLEvent {
+  OpenCLEvent(cl_event Event) : Event(Event) {}
+
----------------
Do you want this to be explicit?

================
Comment at: acxxel/opencl_acxxel.cpp:45
+
+thread_local int ActiveDeviceIndex = 0;
+
----------------
Same for CUDA: Using TLS like this means that it doesn't ever make sense for us to have two platform objects going.  Which is probably correct for other reasons, too.

So maybe instead of having createFoo functions, we should have getFoo functions that return a singleton (created lazily, using `static Foo* = new Foo`).

I know this is basically what the current string-to-platform function is trying to do, but I guess the suggestion is to push the singleton-ness down into the lower-level factory functions, since singletonness is necessary for correctness.

================
Comment at: acxxel/opencl_acxxel.cpp:73
+
+class OpenCLPlatform : public Platform {
+public:
----------------
This could live inside the anon ns?  That's a potential speedup, too, since it lets the compiler make assumptions about this class that it can't otherwise.

Same for CUDA, if applicable (I didn't check).

================
Comment at: acxxel/opencl_acxxel.cpp:176
+    cl_uint NumDevices;
+    cl_device_id Devices[100];
+    if (cl_int Result =
----------------
s/100/MaxNumEntries/?

================
Comment at: acxxel/opencl_acxxel.cpp:214
+  if (DeviceIndex == ActiveDeviceIndex)
+    return Status();
+  if (static_cast<size_t>(DeviceIndex) >= FullDeviceIDs.size())
----------------
It's a nit, but I'm not sure we need this first if statement?

================
Comment at: acxxel/opencl_acxxel.cpp:535
+  OpenCLEvent *CLEvent = static_cast<OpenCLEvent *>(Event);
+  std::unique_lock<std::mutex> Lock(CLEvent->Mutex);
+  cl_event OldEvent = CLEvent->Event;
----------------
Did we decide to make streams single-threaded to avoid this overhead?

If so, in addition to getting rid of the mutex, we should get rid of the comment about concurrently enqueueing an event, since it's not valid to touch the stream concurrently at all.

================
Comment at: acxxel/span.h:10
+///
+/// This file defines the span utility class.
+///
----------------
I really don't like these top-level comments that say nothing you couldn't have guessed from the filename, even though we have them all over LLVM.

Does having this really help doxygen?  If so, would it make doxygen look awful if we moved the comment currently on the span class into this block, so that it actually provided some value?

================
Comment at: acxxel/span.h:26
+/// Value used to indicate slicing to the end of the span.
+static std::ptrdiff_t dynamic_extent = -1; // NOLINT
+
----------------
Can we make this constexpr?

================
Comment at: acxxel/span.h:59
+    if (Count < 0 || (!Ptr && Count))
+      std::terminate();
+  }
----------------
I am really surprised that this is what they suggest in the proposal.  It's fine for now, if it becomes a problem we can turn it into an assert().

================
Comment at: acxxel/span.h:94
+           !std::is_base_of<SpanBase, Container>::value &&
+           std::is_convertible<decltype(&Cont[0]), pointer>::value>::type * =
+           nullptr)
----------------
Nit, could we just use `data()` in the decltype?  We're going to use it in the constructor below anyway.  Or is this as specified and the spec is just weird?

================
Comment at: acxxel/span.h:195
+
+  pointer data() const noexcept { return Data; }
+
----------------
I cannot figure out the reason that some of these functions are noexcept but not others.

================
Comment at: acxxel/status.h:22
+#ifdef __clang__
+#define ACXXEL_WARN_UNUSED_RESULT __attribute__((warn_unused_result))
+#else
----------------
gcc seems to support this, at least at head: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes

================
Comment at: acxxel/status.h:34
+public:
+  /// Creates an Status representing success.
+  Status() : HasMessage(false) {}
----------------
s/an/a/, and below.

================
Comment at: acxxel/status.h:57
+    return *this;
+  }
+
----------------
Can we not `=default` these four?

================
Comment at: acxxel/status.h:82
+  // Intentionally implicit.
+  Expected(const Status &AnError)
+      : TheState(State::FAILURE), TheError(AnError) {
----------------
Since we're going to take ownership of the status, we should take it by value, rather than const ref (and then `std::move` it into `TheError`).

This will make it work efficiently if we're passed an rvalue.

================
Comment at: acxxel/status.h:94
+  // Intentionally implicit.
+  Expected(T &&Value) : TheState(State::SUCCESS), TheValue(std::move(Value)) {}
+
----------------
We can do the same thing with Value -- just take it by value and std::move it.

================
Comment at: acxxel/status.h:154
+    return *this;
+  }
+
----------------
dblaikie taught me today that we could also take Expected by value in our operator=.  This would let us have 
just one overload instead of two, kind of nice.

================
Comment at: acxxel/tests/acxxel_test.cpp:359
+                   .takeStatus()
+                   .isError());
+
----------------
Can we use proper synchronization operations, like a condvar?  volatile doesn't necessarily do what we want on all platforms -- just sort of happens to work on x86.

================
Comment at: acxxel/tests/acxxel_test.cpp:362
+  // Event0 can only occur after GoFlag and MarkerFlag are set.
+  EXPECT_FALSE(Stream0.enqueueEvent(Event0).takeStatus().isError());
+
----------------
Wait...we have EXPECT*?  Can we use this in the tests where we're currently using `assert()`s?

================
Comment at: acxxel/tests/span_test.cpp:24
+
+template <typename T, size_t N> size_t size(T (&)[N]) { return N; }
+
----------------
Nit, maybe we should call this arraySize, since it only works on arrays.

================
Comment at: acxxel/tests/span_test.cpp:37
+TEST(Span, PtrSizeConstruction) {
+  int ZeroSize = 0;
+  acxxel::Span<int> Span0(nullptr, ZeroSize);
----------------
Nit, not sure why we have a variable for this.  If you just want to make it clear what the second arg to the Span constructor means, you could write

  Span(nullptr, /* size = */ 0);

I believe we even have a clang-tidy check to check comments in this style.

================
Comment at: acxxel/tests/span_test.cpp:79
+TEST(Span, ArrayConstruction) {
+  int Array[3] = {0, 1, 2};
+  acxxel::Span<int> Span(Array);
----------------
Nit, should be consistent wrt we have an explicit length in the array or not.  I kind of prefer not.  (This appears in other files as well.)

================
Comment at: acxxel/tests/span_test.cpp:86
+TEST(Span, StdArrayConstruction) {
+  std::array<int, 3> Array{{0, 1, 2}};
+  acxxel::Span<int> Span(Array);
----------------
Nit, prefer constructing std::arrays with `= {1, 2, 3}`, as it's more natural.  Also, double-braces have this truly awful footgun, which has permanently scarred those of us who have rediscovered it ourselves: https://youtu.be/rNNnPrMHsAA?t=11m15s

https://reviews.llvm.org/D25701