[PATCH] D86376: [HIP] Improve kernel launching latency

Tue Aug 25 10:35:05 PDT 2020

tra added a comment.

In D86376#2236501 <https://reviews.llvm.org/D86376#2236501>, @yaxunl wrote:

> My previous measurements did not warming up, which caused some one time overhead due to device initialization and loading of device binary. With warm up, the call of `__hipPushCallConfigure/__hipPopCallConfigure` takes about 19 us. Based on the trace from rocprofile, the time spent inside these functions can be ignored. Most of the time is spent making the calls. These functions stay in a shared library, which may be the reason why they take such long time. Making them always_inline may get rid of the overhead, however, that would require exposing internal data structures.

It's still suspiciously high. AFAICT, config/push/pull is just an std::vector push/pop. It should not take *that* long.  Few function calls should not lead to microseconds of overhead, once linker has resolved the symbol, if they come from a shared library.
https://github.com/ROCm-Developer-Tools/HIP/blob/master/vdi/hip_platform.cpp#L590

I wonder if it's the logging facilities that add all this overhead.

> The kernel launching latency are measured by a simple loop in which a simple kernel is launched then hipStreamSynchronize is called. trace is collected by rocprofiler and the latency is measured from the end of hipStreamSynchronize to the real start of kernel execution. Without this patch, the latency is about 77 us. With this patch, the latency is about 46 us. The improvement is about 40%. The decrement of 31 us is more than 19 us since it also eliminates the overhead of kernel stub.

This is rather surprising. A function call by itself does *not* have such high overhead. There must be something else. I strongly suspect logging. If you remove logging statements from push/pop without changing anything else, how does that affect performance?

>>> I would like to say the motivation of this change is two folds: 1. improve latency 2. interoperability with C++ programs.
>>
>> Could you elaborate on the "interoperability with C++ programs"? I don't think I see how this patch helps with that. Or what exactly is the issue with C++ interoperability we have now?
>
> In HIP program, a global symbol is generated in host binary to identify each kernel. This symbol is associated with the device kernel by a call of hipRegisterFunction in init functions. Each time the kernel needs to be called, the associated symbol is passed to hipLaunchKernel. In host code, this symbol represents the kernel. Let's call it the kernel symbol. Currently it is the kernel stub function, however, it could be any global symbol, as long as it is registered with hipRegisterFunction, then hipLaunchKernel can use it to find the right kernel and launch it.

So far so good, it matches the way CUDA does that.

> In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the kernel symbol.

Do you mean the host-side symbol, registered with the runtime that you've described above? Or do you mean that the device-side symbol is somehow visible from the host side. I think that's where HIP is different from CUDA.

> Since the kernel symbol is defined in object files generated from HIP. 
> For C/C++ program, as long as it declares the kernel symbol as an external function or variable which matches the name of the original symbol, the linker will resolve to the correct kernel symbol, then the correct kernel can be launched.

The first sentence looks incomplete. It seems to imply that hipLaunchKernel uses the device-side kernel symbol and it's the linker which ties host-side reference with device-side symbol. If that's the case, then I don't understand what purpose is served by hipRegisterFunction. AFAICT, it's not used in this scenario at all.

My mental model of kernel launch mechanics looks like this:

- For a kernel foo, there is a host-side symbol (it's the stub for CUDA) with the name 'foo' and device-side real kernel 'foo'.
- host side linker has no access to device-side symbols, but we do need to associate host and device side 'foo' instances.
- address of host-side foo is registered with runtime to map it to device symbol with the name 'foo'
- when a kernel is launched, call site sets up launch config and calls the stub, passing it the kernel arguments.
- the stub calls the kernel launch function, and passes host-side foo address to the kernel launch function
- launch function finds device-side symbol name via the registration info and does device-side address lookup to obtain it's device address
- run device-side function.

In this scenario, the host-side stub for foo is a regular function, which gdb can stop on and examine kernel arguments.

How is the process different for HIP? I know that we've changed the stub name to avoid debugger confusion about which if the entities corresponds to 'foo'.

> Here comes the nuance with kernel stub function as the kernel symbol. If you still remember, there was a previous patch for HIP to change the kernel stub name. rocgdb requires the device stub to have a different name than the real kernel, since otherwise it will not be able to break on the real kernel only. As a result, the kernel stub now has a prefix `__device_stub_` before mangling.
>
> For example, a kernel `foo` will have a kernel stub with name `__device_stub_foo`.
>
> For a C/C++ program to call kernel `foo`, it needs to declare an external symbol `__device_stub_foo` then launch it. Of course this is an annoyance for C/C++ users, especially this involves mangled names.

It's all done by compiler under the hood. I'm not sure how the stub name affects C/C++ users.

> However, we cannot change the name of the kernel stub to be the same as the kernel, since that will break rocgdb.
> Now the solution is to get rid of the kernel stub function. Instead of use kernel stub function as kernel symbol, we will emit a global variable as kernel symbol. This global variable can have the same name as the kernel, since rocgdb will not break on it.

I do not follow your reasoning why the stub name is a problem. It's awkward, yes, but losing the stub as a specific kernel entry point seems to be a real loss in debugability, which is worse, IMO.
Could you give me an example where the stub name causes problems?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376