[Openmp-commits] [PATCH] D74145: [OpenMP][Offloading] Added support for multiple streams so that multiple kernels can be executed concurrently

Johannes Doerfert via Phabricator via Openmp-commits openmp-commits at lists.llvm.org
Mon Feb 10 08:16:05 PST 2020

jdoerfert added inline comments.

Comment at: openmp/libomptarget/plugins/cuda/src/rtl.cpp:246
+    // By default let's create 32 streams per device
+    EnvNumStreams = 32;
+    envStr = getenv("LIBOMPTARGET_NUM_STREAMS");
ye-luo wrote:
> jdoerfert wrote:
> > ye-luo wrote:
> > > tianshilei1992 wrote:
> > > > jdoerfert wrote:
> > > > > The hardware will cap the number internally anyway so we should go higher here. Maybe 256?
> > > > Sure
> > > I don't like this choice. The hardware limit is 32 which is preferred. Users can play with environment variable if they need more.
> > > On the nvprof, it is impossible to digest 256 streams from OpenMP plus other application streams.
> > @ye-luo Do you experience a downside to 256 streams?
> > 
> > There should not be a performance problem but it should help us to be future and backwards compatible. 
> I don't have strong evidence about performance impact. I though more streams should cost the driver a bit more to monitor and schedule workload to the hardware.
I would expect, or maybe hope, that the driver just does the modulo internally. There is no point in tracking more than the number of hardware streams so why would they. To that end they can just do `hw_stream = user_stream % num_hw_streams`, which would make sense because it is portable (=backwards/future compatible).



More information about the Openmp-commits mailing list