[PATCH] D86376: [HIP] Improve kernel launching latency

Artem Belevich via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Mon Aug 24 15:49:19 PDT 2020


tra added a comment.

In D86376#2234719 <https://reviews.llvm.org/D86376#2234719>, @yaxunl wrote:

>> This patch appears to be somewhere in the gray area to me. My prior experience with CUDA suggests that it will make little to no difference. On the other hand, AMD GPUs may be different enough to prove me wrong. Without specific evidence, I still can't tell what's the case here.
>
> Sorry, the overhead due to `__hipPushConfigure/__hipPopConfigure` is about 60 us. The typical kernel launching latency is about 500us, therefore the improvement is around 10%.

60 *micro seconds* to store/load something from memory? It does not sound right. 0.5 millisecond per kernel launch is also suspiciously high. 
For CUDA it's ~5us (https://www.hpcs.cs.tsukuba.ac.jp/icpp2019/data/posters/Poster17-abst.pdf). If it does indeed take 60 microseconds to push/pop a O(cacheline) worth of launch config data, the implementation may be doing something wrong. We're talking about O(100) syscalls and that's way too much work for something that simple. What do those calls do?

Can you confirm that the units are indeed microseconds and not nanoseconds?

> To run HIP applications, users need to install ROCm, which includes rocgdb.

I would disagree with that assertion. I do very much want to build a Tensorflow-based app and run it in a container with nothing else but the app and I do want to use existing infrastructure to capture relevant info if the app crashes. Such capture will not be using any HIP-specific tools. 
Or I could give it to a user who absolutely does not care what's inside the executable, but who may want to run it under gdb if something goes wrong.

> A debugger without device code debugging capability has little use with HIP applications therefore I would expect users to always use rocgdb to debug HIP program.

I agree that it's indeed the case if someone wants/needs to debug GPU code, however, in many cases it's sufficient to be able to debug host-side things only. And it is useful to see the point where we launch kernels and be able to tell which kernel it was.

> Also, since clang already removed all debug information for kernel stub, gdb cannot break on kernel stub any way.

gdb is aware of the ELF symbols and those are often exposed in shared libraries. While you will not have type info, etc, you can still set a breakpoint and get a sensible stack trace in many cases. We usually build with some amount of debug info and it did prove rather helpful to pin-point GPU failures via host-side stack trace as it did include the symbol name of the host-side stub which allows identifying the device-side kernel. If all we see in the stack trace is `hipLaunchKernel`, it would be considerably less helpful, especially when there's no detailed debug info which would allow us to dig out the kernel name from its arguments. All we'd know that we've launched *some* kernel.

>> Is eliminating the host-side stub the goal, or just a coincidental side-effect? I.e. if it's something you *need* to do, then the discussion about minor performance gain becomes rather irrelevant and we should weigh 'improvements in HIP debugging' vs 'regression in host-only debugging' instead.
>
> I would like to say the motivation of this change is two folds: 1. improve latency 2. interoperability with C++ programs.

Could you elaborate on the "interoperability with C++ programs"? I don't think I see how this patch helps with that. Or what exactly is the issue with C++ interoperability we have now?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376



More information about the cfe-commits mailing list