[libc-commits] [libc] [libc][Docs] Update the GPU RPC documentation (PR #79069)

Tue Jan 23 10:30:56 PST 2024

================
@@ -11,10 +11,298 @@ Remote Procedure Calls
 Remote Procedure Call Implementation
 ====================================
 
-Certain features from the standard C library, such as allocation or printing,
-require support from the operating system. We instead implement a remote
-procedure call (RPC) interface to allow submitting work from the GPU to a host
-server that forwards it to the host system.
+Traditionally, the C library abstracts over several functions that interface 
+with the platform's operating system through system calls. The GPU however does 
+not provide an operating system that can handle target dependent operations.
+Instead, we implemented remote procedure calls to interface with the host's 
+operating system while executing on a GPU.
+
+We implemented remote procedure calls using unified virtual memory to create a 
+shared communicate channel between the two processes. This memory is often 
+pinned memory that can be accessed asynchronously and atomically by multiple 
+processes simultaneously. This supports means that we can simply provide mutual 
+exclusion on a shared better to swap work back and forth between the host system 
+and the GPU. We can then use this to create a simple client-server protocol 
+using this shared memory.
+
+This work treats the GPU as a client and the host as a server. The client 
+initiates a communication while the server listens for them. In order to 
+communicate between the host and the device, we simply maintain a buffer of 
+memory and two mailboxes. One mailbox is write-only while the other is 
+read-only. This exposes three primitive operations: using the buffer, giving 
+away ownership, and waiting for ownership. This is implemented as a half-duplex 
+transmission channel between the two sides. We decided to assign ownership of 
+the buffer to the client when the inbox and outbox bits are equal and to the 
+server when they are not.
+
+In order to make this transmission channel thread-safe, we abstract ownership of 
+the given mailbox pair and buffer around a port, effectively acting as a lock 
+and an index into the allocated buffer slice. The server and device have 
+independent locks around the given port. In this scheme, the buffer can be used 
+to communicate intent and data generically with the server. We them simply 
+provide multiple copies of this protocol and expose them as multiple ports.
+
+If this were simply a standard CPU system, this would be sufficient. However, 
+GPUs have my unique architectural challenges. First, GPU threads execute in 
+lock-step with each other in groups typically called warps or wavefronts. We 
+need to target the smallest unit of independent parallelism, so the RPC 
+interface needs to handle an entire group of threads at once. This is done by 
+increasing the size of the buffer and adding a thread mask argument so the 
+server knows which threads are active when it handles the communication. Second, 
+GPUs generally have no forward progress guarantees. In order to guarantee we do 
+not encounter deadlocks while executing it is required that the number of ports 
+matches the maximum amount of hardware parallelism on the device. It is also 
+very important that the thread mask remains consistent while interfacing with 
+the port.
+
+.. image:: ./rpc-diagram.svg
+   :width: 75%
+   :align: center
+
+The above diagram outlines the architecture of the RPC interface. For clarity 
+the following list will explain the operations done by the client and server 
+respectively when initiating a communication.
+
+First, a communication from the perspective of the client:
+
+* The client searches for an available port and claims the lock.
+* The client checks that the port is still available to the current device and 
+  continues if so.
+* The client writes its data to the fixed-size packet and toggles its outbox.
+* The client waits until its inbox matches its outbox.
+* The client reads the data from the fixed-size packet.
+* The client closes the port and continues executing.
+
+Now, the same communication from the perspective of the server:
+
+* The server searches for an available port with pending work and claims the 
+  lock.
+* The server checks that the port is still available to the current device.
+* The server reads the opcode to perform the expected operation, in this 
+  case a receive and then send.
+* The server reads the data from the fixed-size packet.
+* The server writes its data to the fixed-size packet and toggles its outbox.
+* The server closes the port and continues searching for ports that need to be 
+  serviced
----------------
nickdesaulniers wrote:

add punctuation

https://github.com/llvm/llvm-project/pull/79069