[libc-commits] [PATCH] D139839: [libc] Add a loader utility for AMDHSA architectures for testing

Mon Dec 12 06:52:04 PST 2022

jhuber6 added a comment.

In D139839#3988481 <https://reviews.llvm.org/D139839#3988481>, @JonChesterfield wrote:

> The file called start is traditionally called crt for c runtime.  Making to main across linked elf files probably works but it's a bit sketchy, would be inclined to compile it to IR instead.

Wasn't sure if I should exactly copy the `crt` nomenclature, but it might warrant it since it's the same concept. And isn't this linking the same as the handling for the C runtime? Only the "start" code is compiled freestanding so we should be treating the rest of the compilation like a standard build. Also for AMD it is IR.

> The HSA boilerplate looks ok to me. A lot is common to the openmp plugin, if there's somewhere we can put a HSA.hpp that is reachable from libc and openmp I think that would be a win. Doesn't need associated object code, just the template wrappers.

It's possible we could create some "GPUUtils" or something in LLVM, but otherwise I'm not sure if there would be a good place for this to live.

================
Comment at: libc/utils/DeviceLoader/Main.cpp:26
+  // TODO: We should perform some validation on the file.
+  FILE *file = fopen(argv[1], "r");
+
----------------
JonChesterfield wrote:
> If mmap is available it's probably a clearer than fopen here, but using only C is also appealing.
Alternatively we could link against the LLVM libraries and use their methods which get reduced to `mmap`, I could probably make the error handling cleaner with access to LLVM as well.

================
Comment at: libc/utils/DeviceLoader/amdgpu/CMakeLists.txt:37
+  -nostdinc
+  -mcpu=${AMDGPU_ARCH}
+)
----------------
JonChesterfield wrote:
> I think this needs to be emit-llvm for amdgpu
`-c` emits bitcode for `amdgcn-amd-amdhsa` since we consider that the relocatable object type.

================
Comment at: libc/utils/DeviceLoader/amdgpu/Loader.cpp:58
+
+/// Generic interface for iterating using the HSA callbacks.
+template <typename elem_ty, typename func_ty, typename func_arg_ty,
----------------
JonChesterfield wrote:
> This is all pretty familiar. Do we have anywhere reasonable to drop a HSA.hpp header that puts these interfaces over the C?
Making a common header between projects is a little tough, since it would need to live in `LLVM` and this probably doesn't quality for mainline LLVM. Maybe we could encourage the HSA people to provide some C++ bindings.

================
Comment at: libc/utils/DeviceLoader/amdgpu/Loader.cpp:239
+  for (auto &[info, value] : symbol_infos)
+    if (hsa_status_t err = hsa_executable_symbol_get_info(symbol, info, value))
+      handle_error(err);
----------------
JonChesterfield wrote:
> This doesn't work for some of the HSA calls. Group segment size or kernarg segment size maybe? One of them unconditionally returns zero (the openmp plugin reads it via msgpack because of this). Or at least, it used to return zero, I haven't checked recently.
I remember there was some weird behaviour with `KERNARG_SEGMENT_SIZE` not including the implicit arguments. But when I checked the ELF notes on all the binaries created the size was correct so I'm assuming it will be fine since we're always compiling with upstream clang here. I didn't see any other problems with the group size, but I could double check.

================
Comment at: libc/utils/DeviceLoader/amdgpu/Loader.cpp:291
+
+  // Allocate memory for the pointer array on the device and copy it from the
+  // host to the device.
----------------
JonChesterfield wrote:
> Could probably leave argv in fine grain memory instead of setting it up on the GPU, would make this somewhat simpler.
Can the GPU access fine-grained memory directly? I thought this always required some DMA transfer from the pinned buffer. But setting up the `argv` data is definitely the most annoying part.

================
Comment at: libc/utils/DeviceLoader/amdgpu/Start.cpp:13
+_start(int argc, char **argv, int *ret) {
+  __atomic_fetch_or(ret, main(argc, argv), __ATOMIC_RELAXED);
+}
----------------
JonChesterfield wrote:
> Oh, ret has to be on the GPU if you want to fetch_or into it
It should be, in the `Loader.cpp` I allocate some coarse-grained memory for it, set it to zero, and then copy it back.

Also I don't bother freeing any of the memory, not sure how good of a practice that is but I'm mostly just testing stuff right now.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D139839/new/

https://reviews.llvm.org/D139839