[clang] [OpenMP][Clang] Force use of `num_teams` and `thread_limit` for bare kernel (PR #68373)

Fri Oct 6 12:07:33 PDT 2023

shiltian wrote:

> I think the follow up, to force the user bound for bare kernels, make sense. I am not sold on this patch though. Why would we disallow users to do the same looping we do in the deviceRTL while hoping the offload runtime will pick a good grid size?

Because we don't have loop trip count in this case, so the runtime picks how many, 3200 thread blocks and 128 threads per thread block IIRC. I'm not sure that can be called a "good" grid size and we don't have any heuristic w/o loop trip count anyway.

Typically when writing a CUDA/HIP kernel, users calculate the grid/block size manually and launch the kernel using that sizes. That is the main reason for this patch. This can also make the runtime decision much easier: if we can't meet users' requirement, we crash.

https://github.com/llvm/llvm-project/pull/68373