[Openmp-commits] [PATCH] D74145: [OpenMP][Offloading] Added support for multiple streams so that multiple kernels can be executed concurrently

Sat Feb 8 13:45:14 PST 2020

tianshilei1992 added a comment.

I did a little experiment to show the performance improvement. Here is the micro benchmark:

  #include <math.h>
  #include <stddef.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <time.h>

  void kernel() {
    const int num_threads = 64;

  #pragma omp parallel for
    for (int i = 0; i < num_threads; ++i) {
      const size_t N = 1UL << 10;

  #pragma omp target teams distribute parallel for
      for (size_t i = 0; i < N; ++i) {
        for (size_t j = 0; j < N / 2; ++j) {
          float x = sqrt(pow(3.14159, j));
        }
      }
    }
  }

  int main(int argc, char *argv[]) {
    const int N = 1000;

    const clock_t start = clock();

    for (int i = 0; i < N; ++i) {
      kernel();
    }

    const clock_t duration = (clock() - start) * 1000 / CLOCKS_PER_SEC / N;

    printf("Avg time: %ld ms\n", duration);

    return 0;
  }

The execution result with multiple stream is:

  $ /usr/local/cuda/bin/nvprof --output-profile parallel_offloading_ms.prof -f ./parallel_offloading
  ==32397== NVPROF is profiling process 32397, command: ./parallel_offloading
  Avg time: 1081 ms
  ==32397== Generated result file: /home/shiltian/Documents/project/multiple_streams/tests/multistreams/parallel_offloading_ms.prof

And the result w/o multiple stream is:

  $ /usr/local/cuda/bin/nvprof --output-profile parallel_offloading.prof -f ./parallel_offloading
  ==35547== NVPROF is profiling process 35547, command: ./parallel_offloading
  Avg time: 5825 ms
  ==35547== Generated result file: /home/shiltian/Documents/project/multiple_streams/tests/multistreams/parallel_offloading.prof

We can see that 1081 vs 5825 ms, approximately 5.4x speedup.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D74145/new/

https://reviews.llvm.org/D74145