[Openmp-commits] [PATCH] D80649: [OpenMP] Improve D2D memcpy to use more efficient driver API

Wed May 27 11:23:17 PDT 2020

tianshilei1992 added a comment.

Just copy the execution results from Summit.

  ==22767== NVPROF is profiling process 22767, command: ./d2d_memcpy
  ==22767== Profiling application: ./d2d_memcpy
  PASS
  ==22767== Profiling result:
     Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream          Src Dev   Src Ctx          Dst Dev   Dst Ctx  Name
  949.72ms  1.7920us                    -               -         -         -         -        1B  544.96KB/s      Device    Pageable  Tesla V100-SXM2         1         7                -         -                -         -  [CUDA memcpy DtoH]
  949.77ms  1.7920us                    -               -         -         -         -        1B  544.96KB/s      Device    Pageable  Tesla V100-SXM2         1         7                -         -                -         -  [CUDA memcpy DtoH]
  949.80ms  1.5360us                    -               -         -         -         -        4B  2.4835MB/s    Pageable      Device  Tesla V100-SXM2         1         7                -         -                -         -  [CUDA memcpy HtoD]
  949.87ms  457.87ms        (2097152 1 1)       (128 1 1)        44      946B        0B         -           -           -           -  Tesla V100-SXM2         1        19                -         -                -         -  __omp_offloading_32_a7b5d52_main_l34 [128]
  1.40840s  22.820ms                    -               -         -         -         -  1.0000GB  43.822GB/s      Device      Device  Tesla V100-SXM2         1        19  Tesla V100-SXM2         1  Tesla V100-SXM2         2  [CUDA memcpy PtoP]
  1.46565s  1.7920us                    -               -         -         -         -        1B  544.96KB/s      Device    Pageable  Tesla V100-SXM2         2        52                -         -                -         -  [CUDA memcpy DtoH]
  1.46568s  1.7920us                    -               -         -         -         -        1B  544.96KB/s      Device    Pageable  Tesla V100-SXM2         2        52                -         -                -         -  [CUDA memcpy DtoH]
  1.46572s  1.5360us                    -               -         -         -         -        4B  2.4835MB/s    Pageable      Device  Tesla V100-SXM2         2        52                -         -                -         -  [CUDA memcpy HtoD]
  1.48614s  492.70ms        (2097152 1 1)       (128 1 1)        46      946B        0B         -           -           -           -  Tesla V100-SXM2         2        64                -         -                -         -  __omp_offloading_32_a7b5d52_main_l49 [149]
  1.97885s  159.89ms                    -               -         -         -         -  1.0000GB  6.2542GB/s      Device    Pageable  Tesla V100-SXM2         2        64                -         -                -         -  [CUDA memcpy DtoH]

  Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
  SSMem: Static shared memory allocated per CUDA block.
  DSMem: Dynamic shared memory allocated per CUDA block.
  SrcMemType: The type of source memory accessed by memory operation/copy
  DstMemType: The type of destination memory accessed by memory operation/copy

With PeerToPeer copy, the throughput can reach 43+GB/s.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80649/new/

https://reviews.llvm.org/D80649