[Openmp-commits] [PATCH] D80649: [OpenMP] Improve D2D memcpy to use more efficient driver API
Shilei Tian via Phabricator via Openmp-commits
openmp-commits at lists.llvm.org
Wed May 27 11:23:16 PDT 2020
tianshilei1992 created this revision.
Herald added subscribers: openmp-commits, sstefan1, guansong, yaxunl.
Herald added a reviewer: jdoerfert.
Herald added a project: OpenMP.
tianshilei1992 added a comment.
Just copy the execution results from Summit.
==22767== NVPROF is profiling process 22767, command: ./d2d_memcpy
==22767== Profiling application: ./d2d_memcpy
PASS
==22767== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Src Dev Src Ctx Dst Dev Dst Ctx Name
949.72ms 1.7920us - - - - - 1B 544.96KB/s Device Pageable Tesla V100-SXM2 1 7 - - - - [CUDA memcpy DtoH]
949.77ms 1.7920us - - - - - 1B 544.96KB/s Device Pageable Tesla V100-SXM2 1 7 - - - - [CUDA memcpy DtoH]
949.80ms 1.5360us - - - - - 4B 2.4835MB/s Pageable Device Tesla V100-SXM2 1 7 - - - - [CUDA memcpy HtoD]
949.87ms 457.87ms (2097152 1 1) (128 1 1) 44 946B 0B - - - - Tesla V100-SXM2 1 19 - - - - __omp_offloading_32_a7b5d52_main_l34 [128]
1.40840s 22.820ms - - - - - 1.0000GB 43.822GB/s Device Device Tesla V100-SXM2 1 19 Tesla V100-SXM2 1 Tesla V100-SXM2 2 [CUDA memcpy PtoP]
1.46565s 1.7920us - - - - - 1B 544.96KB/s Device Pageable Tesla V100-SXM2 2 52 - - - - [CUDA memcpy DtoH]
1.46568s 1.7920us - - - - - 1B 544.96KB/s Device Pageable Tesla V100-SXM2 2 52 - - - - [CUDA memcpy DtoH]
1.46572s 1.5360us - - - - - 4B 2.4835MB/s Pageable Device Tesla V100-SXM2 2 52 - - - - [CUDA memcpy HtoD]
1.48614s 492.70ms (2097152 1 1) (128 1 1) 46 946B 0B - - - - Tesla V100-SXM2 2 64 - - - - __omp_offloading_32_a7b5d52_main_l49 [149]
1.97885s 159.89ms - - - - - 1.0000GB 6.2542GB/s Device Pageable Tesla V100-SXM2 2 64 - - - - [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
With PeerToPeer copy, the throughput can reach 43+GB/s.
In current implementation, D2D memcpy is first to copy data back to host and then
copy from host to device. This is very efficient if the device supports D2D
memcpy, like CUDA.
In this patch, D2D memcpy will first try to use native supported driver API. If
it fails, fall back to original way. It is worth noting that D2D memcpy in this
scenerio contains two ideas:
- Same devices: this is the D2D memcpy in the CUDA context.
- Different devices: this is the PeerToPeer memcpy in the CUDA context.
My implementation merges this two parts. It chooses the best API according to
the source device and destination device.
Repository:
rG LLVM Github Monorepo
https://reviews.llvm.org/D80649
Files:
openmp/libomptarget/include/omptargetplugin.h
openmp/libomptarget/plugins/cuda/src/rtl.cpp
openmp/libomptarget/plugins/exports
openmp/libomptarget/src/api.cpp
openmp/libomptarget/src/device.cpp
openmp/libomptarget/src/device.h
openmp/libomptarget/src/rtl.cpp
openmp/libomptarget/src/rtl.h
openmp/libomptarget/test/offloading/d2d_memcpy.c
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D80649.266606.patch
Type: text/x-patch
Size: 14718 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/openmp-commits/attachments/20200527/7a5f704d/attachment-0001.bin>
More information about the Openmp-commits
mailing list