[libc-commits] [PATCH] D145913: [libc] Add initial support for an RPC mechanism for the GPU

Joseph Huber via Phabricator via libc-commits libc-commits at lists.llvm.org
Fri Mar 31 07:20:42 PDT 2023


jhuber6 added a comment.

In D145913#4236641 <https://reviews.llvm.org/D145913#4236641>, @gchatelet wrote:

> I stumbled upon this code while looking at `write_to_stderr` and I think that the `PRINT_TO_STDERR` opcode is buggy (see comments). Also since we're exchanging messages of 64B, why waste 8B on the Opcode? Do we need such a huge opcode space ?
> I'm working on a patch to try to fix these issues, I'll ping back when it's ready for discussion.

This code is heavily WIP. Almost all of this is going to change at some point in the future. This exists mainly to let us stand up unit tests. The reason for the opcode's size right now was mainly just alignment. We want the actual arguments to at least have 8B alignment. I'm going to change this in the future. Something like this, but I'm still thinking about it.

  struct Port {
    cpp::Atomic<uint32_t> flags;
    uint32_t opcode;
    uint64_t activemask;
  }



================
Comment at: libc/src/__support/OSUtil/gpu/io.cpp:24
+          inline_memcpy(reinterpret_cast<char *>(&buffer->data[1]), &msg[i],
+                        (length > buffer_len ? buffer_len : length));
+        },
----------------
gchatelet wrote:
> The number of bytes copied at each round are either `buffer_len` (i.e. 56B) or the **total** length of the string, it should be the remainder.
Yeah, don't know why this one works. Probably more accidental behavior given that the memory allocation is a lot larger than the size of the buffer I'm using.


================
Comment at: libc/utils/gpu/loader/amdgpu/Loader.cpp:50
+        case __llvm_libc::rpc::Opcode::PRINT_TO_STDERR: {
+          fputs(reinterpret_cast<const char *>(&buffer->data[1]), stderr);
+          break;
----------------
gchatelet wrote:
> If the message's length is greater than 56B, `&buffer->data[1]` contains only a fragment of the message and is not null-terminated, leading `fputs` to read past it's allocated buffer.
You're right. This worked in practice so I never noticed. Most likely because allocating fine-grained memory most likely gives you a full page at a minimum. I never write to the rest of that data and it's implicitly initialized to zero.


================
Comment at: libc/utils/gpu/loader/amdgpu/Loader.cpp:314
+  void *buffer;
+  if (hsa_status_t err = hsa_amd_memory_pool_allocate(
+          finegrained_pool, sizeof(__llvm_libc::cpp::Atomic<int>),
----------------
gchatelet wrote:
> Are there any guarantees that this piece of memory has the same alignment that `__llvm_libc::cpp::Atomic<int>`.
> Same for the Cuda loader.
Fine-grained memory in this context is always going to be aligned to a page as far as I know. So that'll usually be aligned on a 4096 byte boundary.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D145913/new/

https://reviews.llvm.org/D145913



More information about the libc-commits mailing list