[PATCH] D86376: [HIP] Improve kernel launching latency

Artem Belevich via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Mon Aug 24 11:08:43 PDT 2020


tra added a comment.

How much does this inlining buy you in practice? I.e. what's a typical launch latency before/after the patch? For CUDA, config push/pop is negligible compared to the cost of actually launching the kernel on the GPU. It is measurable if the launch is asynchronous, but queueing kernels fast, does not help all that much in the long run -- you eventually have to run those kernels on the GPU, so in most cases you're just spend a bit more time idling while waiting for the queued kernels to finish. To be beneficial, you'll need a finely balanced CPU/GPU workload and that's rather hard to achieve. Not to the point where the minor savings here would be meaningful. I would assume the situation on AMD GPUs is not that different.

One side effect of this patch is that there will be no convenient way to set host-side breakpoint on kernel launch.
Another will be that examining call stack will become somewhat confusing as the arguments passed to the kernel as written in the source code will not match those observed in the stack trace. I guess preserving the appearance of normal function calls was the reason for the split  config setup/kernel launch in CUDA.  I'd say it's still useful to have as CUDA-specific debugger is not always available and one must use regular gdb on CUDA apps now and then.

If the patch does give measurable performance improvement, can we implement launch config push/pop in a way that compiler can eliminate by itself when it's possible and keep the stub as the host-side kernel entry point? I would prefer to avoid sacrificing debugging usability for performance optimizations that may not matter.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376



More information about the cfe-commits mailing list