[PATCH] D86376: [HIP] Improve kernel launching latency

Wed Aug 26 09:29:01 PDT 2020

yaxunl added a comment.

In D86376#2236704 <https://reviews.llvm.org/D86376#2236704>, @tra wrote:

> 

> It's still suspiciously high. AFAICT, config/push/pull is just an std::vector push/pop. It should not take *that* long.  Few function calls should not lead to microseconds of overhead, once linker has resolved the symbol, if they come from a shared library.
> https://github.com/ROCm-Developer-Tools/HIP/blob/master/vdi/hip_platform.cpp#L590
>
> I wonder if it's the logging facilities that add all this overhead.

You are right. The 19 us are mostly due to overhead from rocprofiler. If I do not use rocprofiler and use a simple loop to measure execution time of `__hipPushCallConfigure/__hipPopCallConfigure`, I got 180 ns.

>> The kernel launching latency are measured by a simple loop in which a simple kernel is launched then hipStreamSynchronize is called. trace is collected by rocprofiler and the latency is measured from the end of hipStreamSynchronize to the real start of kernel execution. Without this patch, the latency is about 77 us. With this patch, the latency is about 46 us. The improvement is about 40%. The decrement of 31 us is more than 19 us since it also eliminates the overhead of kernel stub.
>
> This is rather surprising. A function call by itself does *not* have such high overhead. There must be something else. I strongly suspect logging. If you remove logging statements from push/pop without changing anything else, how does that affect performance?

The 19 us overhead was due to rocprofiler. Without rocprofiler, I can only measure the average duration of a kernel launching together with hipStreamSynchronize. When the kernel is empty, it serves as an estimation of kernel launching latency. With such measurement, the latency is about 14.0 us. The improvement due to this patch is not significant.

>> In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the kernel symbol.
>
> Do you mean the host-side symbol, registered with the runtime that you've described above? Or do you mean that the device-side symbol is somehow visible from the host side. I think that's where HIP is different from CUDA.

I mean the host-side symbol. A host program can only use host-side symbol to launch a kernel.

> I do not follow your reasoning why the stub name is a problem. It's awkward, yes, but losing the stub as a specific kernel entry point seems to be a real loss in debugability, which is worse, IMO.
> Could you give me an example where the stub name causes problems?

For example, in HIP program, there is a kernel `void foo(int*)`. If a C++ program wants to launch it, the desirable way is

  void foo(int*);
  hipLaunchKernel(foo, grids, blocks, args, shmem, stream);

Due to the prefixed kernel stub name, currently the users have to use

  void __device_stub_foo(int*);
  hipLaunchKernel(__device_stub_foo, grids, blocks, args, shmem, stream);

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376