[Openmp-commits] [PATCH] D80649: [OpenMP] Improve D2D memcpy to use more efficient driver API
Shilei Tian via Phabricator via Openmp-commits
openmp-commits at lists.llvm.org
Wed May 27 11:23:17 PDT 2020
tianshilei1992 added a comment.
Just copy the execution results from Summit.
==22767== NVPROF is profiling process 22767, command: ./d2d_memcpy
==22767== Profiling application: ./d2d_memcpy
PASS
==22767== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Src Dev Src Ctx Dst Dev Dst Ctx Name
949.72ms 1.7920us - - - - - 1B 544.96KB/s Device Pageable Tesla V100-SXM2 1 7 - - - - [CUDA memcpy DtoH]
949.77ms 1.7920us - - - - - 1B 544.96KB/s Device Pageable Tesla V100-SXM2 1 7 - - - - [CUDA memcpy DtoH]
949.80ms 1.5360us - - - - - 4B 2.4835MB/s Pageable Device Tesla V100-SXM2 1 7 - - - - [CUDA memcpy HtoD]
949.87ms 457.87ms (2097152 1 1) (128 1 1) 44 946B 0B - - - - Tesla V100-SXM2 1 19 - - - - __omp_offloading_32_a7b5d52_main_l34 [128]
1.40840s 22.820ms - - - - - 1.0000GB 43.822GB/s Device Device Tesla V100-SXM2 1 19 Tesla V100-SXM2 1 Tesla V100-SXM2 2 [CUDA memcpy PtoP]
1.46565s 1.7920us - - - - - 1B 544.96KB/s Device Pageable Tesla V100-SXM2 2 52 - - - - [CUDA memcpy DtoH]
1.46568s 1.7920us - - - - - 1B 544.96KB/s Device Pageable Tesla V100-SXM2 2 52 - - - - [CUDA memcpy DtoH]
1.46572s 1.5360us - - - - - 4B 2.4835MB/s Pageable Device Tesla V100-SXM2 2 52 - - - - [CUDA memcpy HtoD]
1.48614s 492.70ms (2097152 1 1) (128 1 1) 46 946B 0B - - - - Tesla V100-SXM2 2 64 - - - - __omp_offloading_32_a7b5d52_main_l49 [149]
1.97885s 159.89ms - - - - - 1.0000GB 6.2542GB/s Device Pageable Tesla V100-SXM2 2 64 - - - - [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
With PeerToPeer copy, the throughput can reach 43+GB/s.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D80649/new/
https://reviews.llvm.org/D80649
More information about the Openmp-commits
mailing list