[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing
Jonas Hahnfeld via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Sat Sep 29 04:44:30 PDT 2018
Hahnfeld added a comment.
In https://reviews.llvm.org/D52434#1249399, @gtbercea wrote:
> In https://reviews.llvm.org/D52434#1249186, @Hahnfeld wrote:
>
> > In https://reviews.llvm.org/D52434#1249102, @gtbercea wrote:
> >
> > > You report a slow down which I am not able to reproduce actually. Do you use any additional clauses not present in your previous post?
> >
> >
> > No, only `dist_schedule(static)` which is faster. Tested on a `Tesla P100` with today's trunk version:
> >
> > | `#pragma omp target teams distribute parallel for` (new defaults) | 190 - 250 GB/s |
> > | adding clauses for old defaults: `schedule(static) dist_schedule(static)` | 30 - 50 GB/s |
> > | same directive with only `dist_schedule(static)` added (fewer registers) | 320 - 400 GB/s |
> > |
>
>
> Which loop size you're using ? What runtime does nvprof report for these kernels?
Sorry, forgot to mention: I'm using the original STREAM code with 80,000,000 `double` elements in each vector.
Output from `nvprof`:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 70.05% 676.71ms 9 75.191ms 1.3760us 248.09ms [CUDA memcpy DtoH]
7.67% 74.102ms 10 7.4102ms 7.3948ms 7.4220ms __omp_offloading_34_b871a7d5_main_l307
7.63% 73.679ms 10 7.3679ms 7.3457ms 7.3811ms __omp_offloading_34_b871a7d5_main_l301
6.78% 65.516ms 10 6.5516ms 6.5382ms 6.5763ms __omp_offloading_34_b871a7d5_main_l295
6.77% 65.399ms 10 6.5399ms 6.5319ms 6.5495ms __omp_offloading_34_b871a7d5_main_l289
0.68% 6.6106ms 1 6.6106ms 6.6106ms 6.6106ms __omp_offloading_34_b871a7d5_main_l264
0.41% 3.9659ms 1 3.9659ms 3.9659ms 3.9659ms __omp_offloading_34_b871a7d5_main_l245
0.00% 1.1200us 1 1.1200us 1.1200us 1.1200us [CUDA memcpy HtoD]
API calls: 51.12% 678.90ms 9 75.434ms 24.859us 248.70ms cuMemcpyDtoH
22.40% 297.51ms 42 7.0835ms 4.0042ms 7.6802ms cuCtxSynchronize
20.31% 269.72ms 1 269.72ms 269.72ms 269.72ms cuCtxCreate
5.32% 70.631ms 1 70.631ms 70.631ms 70.631ms cuCtxDestroy
0.46% 6.1607ms 1 6.1607ms 6.1607ms 6.1607ms cuModuleLoadDataEx
0.28% 3.7628ms 1 3.7628ms 3.7628ms 3.7628ms cuModuleUnload
0.10% 1.2977ms 42 30.898us 13.930us 60.092us cuLaunchKernel
0.00% 56.142us 42 1.3360us 677ns 2.0930us cuFuncGetAttribute
0.00% 43.957us 46 955ns 454ns 1.7670us cuCtxSetCurrent
0.00% 15.179us 1 15.179us 15.179us 15.179us cuMemcpyHtoD
0.00% 7.2780us 10 727ns 358ns 1.4760us cuModuleGetGlobal
0.00% 6.9910us 2 3.4950us 2.2660us 4.7250us cuDeviceGetPCIBusId
0.00% 5.7500us 6 958ns 333ns 3.5270us cuModuleGetFunction
0.00% 3.7530us 9 417ns 184ns 1.0850us cuDeviceGetAttribute
0.00% 2.6790us 3 893ns 370ns 1.9300us cuDeviceGetCount
0.00% 2.0090us 3 669ns 484ns 767ns cuDeviceGet
The memcpy comes from a `target update` to verify the results on the host. It's not included in the measurement itself, so STREAM only evaluates the kernel execution time:
Function Best Rate MB/s Avg time Min time Max time
Copy: 190819.6 0.006781 0.006708 0.006841
Scale: 189065.7 0.006800 0.006770 0.006831
Add: 253831.7 0.007616 0.007564 0.007646
Triad: 253432.3 0.007668 0.007576 0.007737
Repository:
rC Clang
https://reviews.llvm.org/D52434
More information about the cfe-commits
mailing list