Sorry for the late reply, as I have been on vacation for awhile.

One method which I haven't seen mentioned is to separate out the kernel semantics from the function definition.

All the kernel attribute does is specify that this function is an entry point to the device from the host. So, why not just create a separate entry point that is only callable by the host and everything from the device goes to the original entry point.

For example, you have two functions and one calls the other:

kernel foo() {
kernel bar() {

If you separate kernel function from the function body, then handling this becomes easy.

You end up with four functions:

kernel foo_kernel() {

foo() {

kernel bar_kernel() {


Then the issue is no longer a compilation problem, but just an entry point runtime issue. Instead of calling foo(), the runtime just calls foo_kernel() which handles all of the kernel setup issues and then calls the function body itself.

This removes the need to have any metadata nodes in the IR and allows the kernel function to handle any setup issues for the specific device such as __local's, id/group calculations, memory offsets, etc... without having to impact the performance of a kernel calling another kernel. 


