[libc-commits] [libc] [libc] Update the AMDGPU implementation to use code object 5 (PR #72580)

Thu Nov 16 17:13:08 PST 2023

================
@@ -222,17 +250,15 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
   // Set up the packet for exeuction on the device. We currently only launch
   // with one thread on the device, forcing the rest of the wavefront to be
   // masked off.
-  uint16_t setup = (1 + (params.num_blocks_y * params.num_threads_y != 1) +
-                    (params.num_blocks_z * params.num_threads_z != 1))
-                   << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
+  uint16_t setup = (dims) << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
   packet->workgroup_size_x = params.num_threads_x;
   packet->workgroup_size_y = params.num_threads_y;
   packet->workgroup_size_z = params.num_threads_z;
   packet->reserved0 = 0;
   packet->grid_size_x = params.num_blocks_x * params.num_threads_x;
   packet->grid_size_y = params.num_blocks_y * params.num_threads_y;
   packet->grid_size_z = params.num_blocks_z * params.num_threads_z;
-  packet->private_segment_size = private_size;
+  packet->private_segment_size = !dynamic_stack ? private_size : 16834;
----------------
arsenm wrote:

This is backwards, the assume dynamic access is 16k thing was a backend hack that doesn't require runtime replication. Also that's unnecessary in v5 anyway, so just use the raw size?

Also we really need some unit tests that stress dynamic alloca somewhere 

https://github.com/llvm/llvm-project/pull/72580