[libc-commits] [PATCH] D145913: [libc] Add initial support for an RPC mechanism for the GPU

Sat Apr 1 05:25:56 PDT 2023

jhuber6 added inline comments.

================
Comment at: libc/utils/gpu/loader/amdgpu/Loader.cpp:314
+  void *buffer;
+  if (hsa_status_t err = hsa_amd_memory_pool_allocate(
+          finegrained_pool, sizeof(__llvm_libc::cpp::Atomic<int>),
----------------
gchatelet wrote:
> jhuber6 wrote:
> > gchatelet wrote:
> > > Are there any guarantees that this piece of memory has the same alignment that `__llvm_libc::cpp::Atomic<int>`.
> > > Same for the Cuda loader.
> > Fine-grained memory in this context is always going to be aligned to a page as far as I know. So that'll usually be aligned on a 4096 byte boundary.
> Does that mean that you're reserving one page for just a few bytes? Maybe it would make more sense to reserve a larger chunk of memory and place the objects ourselves (not sure what this implies for the GPU).
> 
> Regarding the rpc mechanism, should the two atomics share the same cache line or would it make sense to have them in separate cache lines to prevent false sharing? It's probably OK to have them on the same cache line if there is only one client and one server.
> 
> Performance wise, it seems important that the atomic don't cross cache line boundaries though. And since placement is important (`alignof(cpp::atomic<T>)` should be honored for proper codegen) maybe the server should just allocate a sufficiently large chunk of memory and let a common function do the actual placement. This would prevent duplicate logic in each server (loader).
> Does that mean that you're reserving one page for just a few bytes? Maybe it would make more sense to reserve a larger chunk of memory and place the objects ourselves (not sure what this implies for the GPU).
>

Yes, this is really wasteful but I'm ignoring it for now since the resource usage of these tests is quite low. This will change when we need to support multiple client calls from GPU threads. To meet hardware requirements the size of the buffer will probably be about 2 MB.

> Regarding the rpc mechanism, should the two atomics share the same cache line or would it make sense to have them in separate cache lines to prevent false sharing? It's probably OK to have them on the same cache line if there is only one client and one server.
> 

I'm probably going to merge these into a single atomic that we use as a bit-field. This is mostly because the underlying RPC state machine will need to support more than four states in the future (client send, server reply, client use, server clean). I'm actually not entirely sure what false sharing would mean in this context. This memory is basically a whole page that's shared with the GPU via the PCI(e) bus. I'm not privy enough to the internals to know if this can cause issues.

> Performance wise, it seems important that the atomic don't cross cache line boundaries though. And since placement is important (alignof(cpp::atomic<T>) should be honored for proper codegen) maybe the server should just allocate a sufficiently large chunk of memory and let a common function do the actual placement. This would prevent duplicate logic in each server (loader).

We should definitely put `alignas` on the struct. I was thinking about doing the above. Having a single buffer makes it much easier to write to the GPU as well. I'm planning on just making this an array of some struct type.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D145913/new/

https://reviews.llvm.org/D145913