[Openmp-commits] [PATCH] D80649: [OpenMP] Improve D2D memcpy to use more efficient driver API

Wed May 27 11:23:16 PDT 2020

tianshilei1992 created this revision.
Herald added subscribers: openmp-commits, sstefan1, guansong, yaxunl.
Herald added a reviewer: jdoerfert.
Herald added a project: OpenMP.
tianshilei1992 added a comment.

Just copy the execution results from Summit.

  ==22767== NVPROF is profiling process 22767, command: ./d2d_memcpy
  ==22767== Profiling application: ./d2d_memcpy
  PASS
  ==22767== Profiling result:
     Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream          Src Dev   Src Ctx          Dst Dev   Dst Ctx  Name
  949.72ms  1.7920us                    -               -         -         -         -        1B  544.96KB/s      Device    Pageable  Tesla V100-SXM2         1         7                -         -                -         -  [CUDA memcpy DtoH]
  949.77ms  1.7920us                    -               -         -         -         -        1B  544.96KB/s      Device    Pageable  Tesla V100-SXM2         1         7                -         -                -         -  [CUDA memcpy DtoH]
  949.80ms  1.5360us                    -               -         -         -         -        4B  2.4835MB/s    Pageable      Device  Tesla V100-SXM2         1         7                -         -                -         -  [CUDA memcpy HtoD]
  949.87ms  457.87ms        (2097152 1 1)       (128 1 1)        44      946B        0B         -           -           -           -  Tesla V100-SXM2         1        19                -         -                -         -  __omp_offloading_32_a7b5d52_main_l34 [128]
  1.40840s  22.820ms                    -               -         -         -         -  1.0000GB  43.822GB/s      Device      Device  Tesla V100-SXM2         1        19  Tesla V100-SXM2         1  Tesla V100-SXM2         2  [CUDA memcpy PtoP]
  1.46565s  1.7920us                    -               -         -         -         -        1B  544.96KB/s      Device    Pageable  Tesla V100-SXM2         2        52                -         -                -         -  [CUDA memcpy DtoH]
  1.46568s  1.7920us                    -               -         -         -         -        1B  544.96KB/s      Device    Pageable  Tesla V100-SXM2         2        52                -         -                -         -  [CUDA memcpy DtoH]
  1.46572s  1.5360us                    -               -         -         -         -        4B  2.4835MB/s    Pageable      Device  Tesla V100-SXM2         2        52                -         -                -         -  [CUDA memcpy HtoD]
  1.48614s  492.70ms        (2097152 1 1)       (128 1 1)        46      946B        0B         -           -           -           -  Tesla V100-SXM2         2        64                -         -                -         -  __omp_offloading_32_a7b5d52_main_l49 [149]
  1.97885s  159.89ms                    -               -         -         -         -  1.0000GB  6.2542GB/s      Device    Pageable  Tesla V100-SXM2         2        64                -         -                -         -  [CUDA memcpy DtoH]

  Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
  SSMem: Static shared memory allocated per CUDA block.
  DSMem: Dynamic shared memory allocated per CUDA block.
  SrcMemType: The type of source memory accessed by memory operation/copy
  DstMemType: The type of destination memory accessed by memory operation/copy

With PeerToPeer copy, the throughput can reach 43+GB/s.

In current implementation, D2D memcpy is first to copy data back to host and then
copy from host to device. This is very efficient if the device supports D2D
memcpy, like CUDA.

In this patch, D2D memcpy will first try to use native supported driver API. If
it fails, fall back to original way. It is worth noting that D2D memcpy in this
scenerio contains two ideas:

- Same devices: this is the D2D memcpy in the CUDA context.
- Different devices: this is the PeerToPeer memcpy in the CUDA context.

My implementation merges this two parts. It chooses the best API according to
the source device and destination device.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D80649

Files:
  openmp/libomptarget/include/omptargetplugin.h
  openmp/libomptarget/plugins/cuda/src/rtl.cpp
  openmp/libomptarget/plugins/exports
  openmp/libomptarget/src/api.cpp
  openmp/libomptarget/src/device.cpp
  openmp/libomptarget/src/device.h
  openmp/libomptarget/src/rtl.cpp
  openmp/libomptarget/src/rtl.h
  openmp/libomptarget/test/offloading/d2d_memcpy.c

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D80649.266606.patch
Type: text/x-patch
Size: 14718 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/openmp-commits/attachments/20200527/7a5f704d/attachment-0001.bin>