[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing

Sat Sep 29 04:44:30 PDT 2018

Hahnfeld added a comment.

In https://reviews.llvm.org/D52434#1249399, @gtbercea wrote:

> In https://reviews.llvm.org/D52434#1249186, @Hahnfeld wrote:
>
> > In https://reviews.llvm.org/D52434#1249102, @gtbercea wrote:
> >
> > > You report a slow down which I am not able to reproduce actually. Do you use any additional clauses not present in your previous post?
> >
> >
> > No, only `dist_schedule(static)` which is faster. Tested on a `Tesla P100` with today's trunk version:
> >
> > | `#pragma omp target teams distribute parallel for` (new defaults)         | 190 - 250 GB/s |
> > | adding clauses for old defaults: `schedule(static) dist_schedule(static)` | 30 - 50 GB/s   |
> > | same directive with only `dist_schedule(static)` added (fewer registers)  | 320 - 400 GB/s |
> > |
>
>
> Which loop size you're using ? What runtime does nvprof report for these kernels?

Sorry, forgot to mention: I'm using the original STREAM code with 80,000,000 `double` elements in each vector.

Output from `nvprof`:

             Type  Time(%)      Time     Calls       Avg       Min       Max  Name
  GPU activities:   70.05%  676.71ms         9  75.191ms  1.3760us  248.09ms  [CUDA memcpy DtoH]
                     7.67%  74.102ms        10  7.4102ms  7.3948ms  7.4220ms  __omp_offloading_34_b871a7d5_main_l307
                     7.63%  73.679ms        10  7.3679ms  7.3457ms  7.3811ms  __omp_offloading_34_b871a7d5_main_l301
                     6.78%  65.516ms        10  6.5516ms  6.5382ms  6.5763ms  __omp_offloading_34_b871a7d5_main_l295
                     6.77%  65.399ms        10  6.5399ms  6.5319ms  6.5495ms  __omp_offloading_34_b871a7d5_main_l289
                     0.68%  6.6106ms         1  6.6106ms  6.6106ms  6.6106ms  __omp_offloading_34_b871a7d5_main_l264
                     0.41%  3.9659ms         1  3.9659ms  3.9659ms  3.9659ms  __omp_offloading_34_b871a7d5_main_l245
                     0.00%  1.1200us         1  1.1200us  1.1200us  1.1200us  [CUDA memcpy HtoD]
       API calls:   51.12%  678.90ms         9  75.434ms  24.859us  248.70ms  cuMemcpyDtoH
                    22.40%  297.51ms        42  7.0835ms  4.0042ms  7.6802ms  cuCtxSynchronize
                    20.31%  269.72ms         1  269.72ms  269.72ms  269.72ms  cuCtxCreate
                     5.32%  70.631ms         1  70.631ms  70.631ms  70.631ms  cuCtxDestroy
                     0.46%  6.1607ms         1  6.1607ms  6.1607ms  6.1607ms  cuModuleLoadDataEx
                     0.28%  3.7628ms         1  3.7628ms  3.7628ms  3.7628ms  cuModuleUnload
                     0.10%  1.2977ms        42  30.898us  13.930us  60.092us  cuLaunchKernel
                     0.00%  56.142us        42  1.3360us     677ns  2.0930us  cuFuncGetAttribute
                     0.00%  43.957us        46     955ns     454ns  1.7670us  cuCtxSetCurrent
                     0.00%  15.179us         1  15.179us  15.179us  15.179us  cuMemcpyHtoD
                     0.00%  7.2780us        10     727ns     358ns  1.4760us  cuModuleGetGlobal
                     0.00%  6.9910us         2  3.4950us  2.2660us  4.7250us  cuDeviceGetPCIBusId
                     0.00%  5.7500us         6     958ns     333ns  3.5270us  cuModuleGetFunction
                     0.00%  3.7530us         9     417ns     184ns  1.0850us  cuDeviceGetAttribute
                     0.00%  2.6790us         3     893ns     370ns  1.9300us  cuDeviceGetCount
                     0.00%  2.0090us         3     669ns     484ns     767ns  cuDeviceGet

The memcpy comes from a `target update` to verify the results on the host. It's not included in the measurement itself, so STREAM only evaluates the kernel execution time:

  Function    Best Rate MB/s  Avg time     Min time     Max time
  Copy:          190819.6     0.006781     0.006708     0.006841
  Scale:         189065.7     0.006800     0.006770     0.006831
  Add:           253831.7     0.007616     0.007564     0.007646
  Triad:         253432.3     0.007668     0.007576     0.007737

Repository:
  rC Clang

https://reviews.llvm.org/D52434