[PATCH] D24715: [OpenCL] Block captured variables in dynamic parallelism - OpenCL 2.0

Tue Oct 4 13:58:52 PDT 2016

Anastasia added a comment.

> Regarding the improvement proposed by us which "flatten" captured variables into invoke_function argument list and block_literal pointer wouldn't be passed as first argument(to invoke_function) anymore. The reason why it doesn't require global memory management is that we can retrieve captured variables with cap_num field and cap_copy_helper routine INSIDE __enqueue_kernel_XXX and passed those captures as arguments to child kernel, rather than saving block_literal variable globally and postpone the retrieving actions until invoke_function, the child kernel body.

Just to be clear, we are now comparing the following two approaches:

(1)

  __enqueue_kernel_XXX ( ... block_literal, size ... ) { // size seems to be missing in the current implementation (?)
     ...
    // copy block literal into accessible for enqueued kernel memory
    memcpy(accessible_mem, block_literal, size); // for efficiency (block_literal + static_header_size) can be used instead
    // activate  block_literal->invoke as a kernel
    ...
  }

  void invoke_function(block_decriptor* d){
    use(d->capA, d->capB);
  }

(2)

  __enqueue_kernel_XXX ( ... block_literal ... ) {
    ...
    // copy block literal captures into accessible for enqueued kernel memory
    memcpy(accessible_mem, &(block_literal->capA), sizeof(block_literal->capA)); // which can be done using cap_copy_helper instead
    memcpy(accessible_mem + sizeof(block_literal->capA), &(block_literal->capB), sizeof(block_literal->capB)); // which can be done using cap_copy_helper instead
    // activate  block_literal->invoke as a kernel
    ...
  }

  void invoke_function(capA_t capA, capB_t capB){
    use(capA, capB);
  }

>From this picture I don't see how the flattening itself can help us to avoid using global memory. Surely in both cases the captures content will have to be copied into the memory accessible for the enqueued kernel (which is a global memory in a general case, but doesn't have to be in some cases I am guessing). Perhaps I am missing some extra step in the approach you are proposing. If you rely on the parameter passing using normal function call into the block_invoke then in both cases we can skip the memcpy of captures at all. Otherwise both appoaches will need to make a copy of the captures.

What we can improve though is avoiding extra data copy using the copy helpers you are proposing (this can also be achieved by calling mempy passing the capture offset pointer into block_literal and captures size instead of the the whole block_literal as highlighted above). We can also potentially avoid reloading of the captures in the enqueued kernel though the capture flattening, but this depends on the calling convention (or rather enqueueing convension I would say).

> About the second implementation category, which implement builtins directly in compiler, we haven't spend time thinking about its detail approach. Maybe we can discuss about this.

Neither did I look at it to be honest as I also find it quite difficult from the maintanance point of view as well as future modifications.

https://reviews.llvm.org/D24715