[libc-commits] [libc] [libc][Docs] Update the GPU RPC documentation (PR #79069)
Nick Desaulniers via libc-commits
libc-commits at lists.llvm.org
Tue Jan 23 10:30:56 PST 2024
================
@@ -11,10 +11,298 @@ Remote Procedure Calls
Remote Procedure Call Implementation
====================================
-Certain features from the standard C library, such as allocation or printing,
-require support from the operating system. We instead implement a remote
-procedure call (RPC) interface to allow submitting work from the GPU to a host
-server that forwards it to the host system.
+Traditionally, the C library abstracts over several functions that interface
+with the platform's operating system through system calls. The GPU however does
+not provide an operating system that can handle target dependent operations.
+Instead, we implemented remote procedure calls to interface with the host's
+operating system while executing on a GPU.
+
+We implemented remote procedure calls using unified virtual memory to create a
+shared communicate channel between the two processes. This memory is often
+pinned memory that can be accessed asynchronously and atomically by multiple
+processes simultaneously. This supports means that we can simply provide mutual
+exclusion on a shared better to swap work back and forth between the host system
+and the GPU. We can then use this to create a simple client-server protocol
+using this shared memory.
+
+This work treats the GPU as a client and the host as a server. The client
+initiates a communication while the server listens for them. In order to
+communicate between the host and the device, we simply maintain a buffer of
+memory and two mailboxes. One mailbox is write-only while the other is
+read-only. This exposes three primitive operations: using the buffer, giving
+away ownership, and waiting for ownership. This is implemented as a half-duplex
+transmission channel between the two sides. We decided to assign ownership of
+the buffer to the client when the inbox and outbox bits are equal and to the
+server when they are not.
+
+In order to make this transmission channel thread-safe, we abstract ownership of
+the given mailbox pair and buffer around a port, effectively acting as a lock
+and an index into the allocated buffer slice. The server and device have
+independent locks around the given port. In this scheme, the buffer can be used
+to communicate intent and data generically with the server. We them simply
+provide multiple copies of this protocol and expose them as multiple ports.
+
+If this were simply a standard CPU system, this would be sufficient. However,
+GPUs have my unique architectural challenges. First, GPU threads execute in
+lock-step with each other in groups typically called warps or wavefronts. We
+need to target the smallest unit of independent parallelism, so the RPC
+interface needs to handle an entire group of threads at once. This is done by
+increasing the size of the buffer and adding a thread mask argument so the
+server knows which threads are active when it handles the communication. Second,
+GPUs generally have no forward progress guarantees. In order to guarantee we do
+not encounter deadlocks while executing it is required that the number of ports
+matches the maximum amount of hardware parallelism on the device. It is also
+very important that the thread mask remains consistent while interfacing with
+the port.
+
+.. image:: ./rpc-diagram.svg
+ :width: 75%
+ :align: center
+
+The above diagram outlines the architecture of the RPC interface. For clarity
+the following list will explain the operations done by the client and server
+respectively when initiating a communication.
+
+First, a communication from the perspective of the client:
+
+* The client searches for an available port and claims the lock.
+* The client checks that the port is still available to the current device and
+ continues if so.
+* The client writes its data to the fixed-size packet and toggles its outbox.
+* The client waits until its inbox matches its outbox.
+* The client reads the data from the fixed-size packet.
+* The client closes the port and continues executing.
+
+Now, the same communication from the perspective of the server:
+
+* The server searches for an available port with pending work and claims the
+ lock.
+* The server checks that the port is still available to the current device.
+* The server reads the opcode to perform the expected operation, in this
+ case a receive and then send.
+* The server reads the data from the fixed-size packet.
+* The server writes its data to the fixed-size packet and toggles its outbox.
+* The server closes the port and continues searching for ports that need to be
+ serviced
----------------
nickdesaulniers wrote:
add punctuation
https://github.com/llvm/llvm-project/pull/79069
More information about the libc-commits
mailing list