[clang] [libc] [llvm] [AMDGPU] Implement variadic functions by IR lowering (PR #93362)

Mon Jun 3 18:36:11 PDT 2024

JonChesterfield wrote:

I wish to ship this as-is. This patch is carefully constructed to be zero risk and landing it significantly improves the chances of zero-overhead varargs shipping. This revision passes the amdgpu libc tests. The tests are fragile to seemingly trivial changes to the IR generated which is obstructive to emitting nicer IR and some changes to the target abstraction I wish to make.

I have approximately isolated the libc / amdgpu memory error. I believe the amdgpu backend miscompiles constructs involving addrspacecast and passing structs to functions in a fashion which interacts badly with the lowering this pass chooses to do. That would explain why the transform has been easy and robust when implemented on x86, x64, nvptx, aarch64 and webassembly - and very prone to confusing memory errors and strange action at a distance on amdgpu. I note that this pass was originally made target-independent because I failed to get it running robustly on amdgpu alone.

Various previous revisions of this pass have passed the amdgpu libc tests and other similar revisions failed, I'd guess it's about 1/3 are "good" and 2/3 are "bad". I want to land this now, on a "good" revision, so we have something in tree that I can iterate on and feed into the ROCm dev branches, where I may have other colleagues or a working debugger available.

The pass core is straightforward and likely to be functionally correct. It isn't in the default pass pipeline so the only non-test time it'll run to begin with is for amdgpu. The lowering is exactly what webassembly does already and very close to nvptx. I'll implement x64 and aarch64 in parallel with chasing amdgpu oddities, both are mechanical exercises in case analysis and test coverage. I believe wasm needs no further changes.

Various things are suboptimal on amdgpu. Passing some things byref is better than the current everything inline layout. The extra parameter should be in (5), not (0). Codegen is noisier than it should be. Most permutations to the current scheme win "memory error" from HSA. I think these should be part of ongoing work after this lands and not a prereq.

https://github.com/llvm/llvm-project/pull/93362