[PATCH] D79972: [OpenMP5.0] map item can be non-contiguous for target update

Tue Jun 2 16:29:06 PDT 2020

cchen added a comment.

In D79972#2069435 <https://reviews.llvm.org/D79972#2069435>, @ABataev wrote:

> In D79972#2069366 <https://reviews.llvm.org/D79972#2069366>, @cchen wrote:
>
> > In D79972#2069358 <https://reviews.llvm.org/D79972#2069358>, @ABataev wrote:
> >
> > > In D79972#2069322 <https://reviews.llvm.org/D79972#2069322>, @cchen wrote:
> > >
> > > > In D79972#2068976 <https://reviews.llvm.org/D79972#2068976>, @ABataev wrote:
> > > >
> > > > > Still: Did you think about implementing it in the compiler instead of the runtime?
> > > >
> > > >
> > > > I'm not sure I understand your question, which part of code are you asking?
> > > >  The main work compiler needs to do is to send the {offset, count, stride} struct to runtime.
> > >
> > >
> > > I mean did you think about calling `__tgt_target_data_update` function in a loop in the compiler-generated code instead of putting it into the runtime?
> >
> >
> > Oh, I would prefer to call `tgt_target_data_update` once in the compiler and I'm also doing it now.
>
>
> I was not quite correct. What I mean, is to generate the array with the array section as VLA in the compiler, and fill it in the loop generated by the compiler for non-contiguous sections but not in the runtime?
>  Say, we have the code:
>
>   int arr[3][3]
>   ...
>    #pragma omp update to(arr[1:2][1:2]
>  
>
>
> In this case, we're going to transfer the next elements:
>
>   000
>   0xx
>   0xx
>
>
> In the compiler-generated code we emit something like this:
>
>   void *bptr[<n>];
>   void *ptr[<n>];
>   int64 sizes[<n>];
>   int64 maptypes[<n>];
>   for (int i = 0; i < <n>; ++i) {
>     bptr[i] = &arr[1+i][1];
>     ptr[i] = &arr[1+i][1];
>     sizes[i] = ...;'
>     maptypes[i] = ...;
>   }
>   call void @__tgt_target_data_update(i64 -1, i32 <n>, bptr, ptr, sizes, maptypes);
>
>
> With this solution, you won't need to modify the runtime and add a new mapping flag.

For my current implementation, we have discussed in the bi-weekly meeting several weeks back, and there was a general consensus that it was an acceptable approach.

The major advantage of sending a descriptor to runtime can be elaborated in the following example:

  #define N 10000
  int a[N][2];
  …
  #pragma amp target update to (a[0:N][0:1])

This would require passing through O(N) entries in the tgt_target_data_update call, or 10000 entries. The current implementation only require a descriptor with 2 entries. I think this could be a real concern -
splitting out the transfers in compiler-generated code results in a list containing one entry per non-contiguous chunk (easily hitting scaling issues), while the descriptor approach is bounded by the number of dimensions.
That seems like a pretty compelling reason to use the descriptor - it’s much more space efficient.

Also, the descriptor idea is very similar to how Cray supported Fortran dope vectors for years (we send in a pointer to a dope vector rather than a pointer to the data, and a flag to indicate it’s a dope vector, and the runtime library handles it as a dope vector).
I think the runtime library changes will not be very extensive or difficult at all and we’re very willing to implement the runtime for non-contiguous.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79972/new/

https://reviews.llvm.org/D79972