[cfe-dev] Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Lingda Li via cfe-dev cfe-dev at lists.llvm.org
Thu Jun 27 10:54:10 PDT 2019


Hi,

Alexey Bataev and I (Lingda Li) would like to have your attention on an
ongoing discussion of 2 schemes to implement the declare mapper in OpenMP
5.0. The detailed discussion can be found at https://reviews.llvm.org/D59474

Scheme 1 (the one has been implemented by me in
https://reviews.llvm.org/D59474):
The detailed design can be found at
https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx
For each mapper function, the compiler generates a function like this:

```
void <type>.mapper(void *base, void *begin, size_t size, int64_t type) {
  // Allocate space for an array section first.
  if (size > 1 && !maptype.IsDelete)
     <push>(base, begin, size*sizeof(Ty), clearToFrom(type));

  // Map members.
  for (unsigned i = 0; i < size; i++) {
     // For each component specified by this mapper:
     for (auto c : components) {
       ...; // code to generate c.arg_base, c.arg_begin, c.arg_size,
c.arg_type
       if (c.hasMapper())
         (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
       else
         <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
     }
  }
  // Delete the array section.
  if (size > 1 && maptype.IsDelete)
    <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
}
```
This function is passed to the OpenMP runtime, and the runtime will call
this function to finish the data mapping.


Scheme 2 (which Alexey proposes):
Alexey proposes to move parts of the mapper function above into the OpenMP
runtime, so the compiler will generate code below:
```
void <type>.mapper(void *base, void *begin, size_t size, int64_t type) {
  ...; // code to generate arg_base, arg_begin, arg_size, arg_type,
arg_mapper.
 auto sub_components[] = {...}; // fill in generated begin, base, ...
 __tgt_mapper(base, begin, size, type, sub_components);
}
```

`__tgt_mapper` is a runtime function as below:
```
void __tgt_mapper(void *base, void *begin, size_t size, int64_t type, auto
components[]) {
  // Allocate space for an array section first.
  if (size > 1 && !maptype.IsDelete)
     <push>(base, begin, size*sizeof(Ty), clearToFrom(type));

  // Map members.
  for (unsigned i = 0; i < size; i++) {
     // For each component specified by this mapper:
     for (auto c : components) {
       if (c.hasMapper())
         (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
       else
         <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
     }
  }
  // Delete the array section.
  if (size > 1 && maptype.IsDelete)
    <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
}
```

Comparison:
Why to choose 1 over 2:
1. In scheme 2,  the compiler needs to generate all map types and pass them
to __tgt_mapper through sub_components. But in this case, the compiler
won't be able to generate the correct MEMBER_OF field in map type. As a
result, the runtime has to fix it using the mechanism we already have here:
__tgt_mapper_num_components. This not only increases complexity, but also,
it means the runtime needs further manipulation of the map type, which
creates locality issues. While in the current scheme, the map type is
generated by compiler once, so the data locality will be very good in this
case.
2. In scheme 2, sub_components includes all components that should be
mapped. If we are mapping an array, this means we need to map many
components, which will need to allocate memory for sub_components in the
heap. This creates further memory management burden and is not an efficient
way to use memory.
3. In scheme 1, we are able to inline nested mapper functions. As a result,
the compiler can do further optimizations to optimize the mapper function,
e.g., eliminate redundant computation, loop unrolling, and thus achieve
potentially better performance. We cannot achieve these optimizations in
scheme 2.

Why to choose 2 over 1:
1. Less code in the mapper function codegen (I doubt this because the
codegen function of scheme 1 uses less than 200 loc)


We will appreciate if you can share your opinions.

Thanks,
Lingda Li
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190627/d888de07/attachment.html>


More information about the cfe-dev mailing list