[PATCH] D86376: [HIP] Improve kernel launching latency

Tue Aug 25 09:16:24 PDT 2020

yaxunl added a comment.

In D86376#2234824 <https://reviews.llvm.org/D86376#2234824>, @tra wrote:

> In D86376#2234719 <https://reviews.llvm.org/D86376#2234719>, @yaxunl wrote:
>
>>> This patch appears to be somewhere in the gray area to me. My prior experience with CUDA suggests that it will make little to no difference. On the other hand, AMD GPUs may be different enough to prove me wrong. Without specific evidence, I still can't tell what's the case here.
>>
>> Sorry, the overhead due to `__hipPushConfigure/__hipPopConfigure` is about 60 us. The typical kernel launching latency is about 500us, therefore the improvement is around 10%.
>
> 60 *micro seconds* to store/load something from memory? It does not sound right. 0.5 millisecond per kernel launch is also suspiciously high. 
> For CUDA it's ~5us (https://www.hpcs.cs.tsukuba.ac.jp/icpp2019/data/posters/Poster17-abst.pdf). If it does indeed take 60 microseconds to push/pop a O(cacheline) worth of launch config data, the implementation may be doing something wrong. We're talking about O(100) syscalls and that's way too much work for something that simple. What do those calls do?
>
> Can you confirm that the units are indeed microseconds and not nanoseconds?

My previous measurements did not warming up, which caused some one time overhead due to device initialization and loading of device binary. With warm up, the call of `__hipPushCallConfigure/__hipPopCallConfigure` takes about 19 us. Based on the trace from rocprofile, the time spent inside these functions can be ignored. Most of the time is spent making the calls. These functions stay in a shared library, which may be the reason why they take such long time. Making them always_inline may get rid of the overhead, however, that would require exposing internal data structures.

The kernel launching latency are measured by a simple loop in which a simple kernel is launched then hipStreamSynchronize is called. trace is collected by rocprofiler and the latency is measured from the end of hipStreamSynchronize to the real start of kernel execution. Without this patch, the latency is about 77 us. With this patch, the latency is about 46 us. The improvement is about 40%. The decrement of 31 us is more than 19 us since it also eliminates the overhead of kernel stub.

>> I would like to say the motivation of this change is two folds: 1. improve latency 2. interoperability with C++ programs.
>
> Could you elaborate on the "interoperability with C++ programs"? I don't think I see how this patch helps with that. Or what exactly is the issue with C++ interoperability we have now?

In HIP program, a global symbol is generated in host binary to identify each kernel. This symbol is associated with the device kernel by a call of hipRegisterFunction in init functions. Each time the kernel needs to be called, the associated symbol is passed to hipLaunchKernel. In host code, this symbol represents the kernel. Let's call it the kernel symbol. Currently it is the kernel stub function, however, it could be any global symbol, as long as it is registered with hipRegisterFunction, then hipLaunchKernel can use it to find the right kernel and launch it.

In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the kernel symbol. Since the kernel symbol is defined in object files generated from HIP. For C/C++ program, as long as it declares the kernel symbol as an external function or variable which matches the name of the original symbol, the linker will resolve to the correct kernel symbol, then the correct kernel can be launched.

Here comes the nuance with kernel stub function as the kernel symbol. If you still remember, there was a previous patch for HIP to change the kernel stub name. rocgdb requires the device stub to have a different name than the real kernel, since otherwise it will not be able to break on the real kernel only. As a result, the kernel stub now has a prefix `__device_stub_` before mangling.

For example, a kernel `foo` will have a kernel stub with name `__device_stub_foo`.

For a C/C++ program to call kernel `foo`, it needs to declare an external symbol `__device_stub_foo` then launch it. Of course this is an annoyance for C/C++ users, especially this involves mangled names.

However, we cannot change the name of the kernel stub to be the same as the kernel, since that will break rocgdb.

Now the solution is to get rid of the kernel stub function. Instead of use kernel stub function as kernel symbol, we will emit a global variable as kernel symbol. This global variable can have the same name as the kernel, since rocgdb will not break on it.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376