[Openmp-dev] questions about loop scheduling

Sat Jan 12 18:50:59 PST 2019

Hi everyone,

I'd like to ask and clarify questions that I came up with while I am
working on performance tracing with various loop schedules.

1. For static/chunked schedule with decreasing loop, I've observed
something strange with kmp(c)_for_static_init(). I checked loop iteration
lower and upper bound, a size of chunk, and a size of loop increment at the
beginning and end of kmp_static_init by enabling debug (KMP_D_DEBUG)
and they looked like as if I set up increasing loop while it was working
well as I expected.

To be more specific, let me give you a very simple example as follows:

omp_set_num_threads(4);
#pragma omp parallel for schedule(static, 4)
for( int i = 31; i >=0; i--) { LOOP_BODY };

In this case, thread0 would get the chunk of iteration between 31 and 28
and between 15 and 12, thread1 between 27 and 24 and between 11 and 8, and
so forth. However I saw kmp(c)_for_static_init for thread0 had (LB:0, UB:3,
ST:16) as if it were an increasing loop starting at 0 or there should be a
way for runtime to interpret a task of LB 0 as a loop iteration with the
index of 31.

The reason why I care about this is because I'd like to track each work
chunk by scheduling it arbitrarily and I was considering LB (lower bound)
as a unique ID for each work chunk. However in this case, I don't know how
I extract the correct bound associate with loop iteration number of
scheduled chunk. I am also looking into an implementation of Clang codegen
library for OpenMP to see if what's going on there but I feel I am a bit
lost.

Please help me find what I missed and correct me if I went anything wrong.

2. To measure an overhead of dynamic/chunked scheduling, I've added
callbacks at the beginning and end of __kmp_dispatch_next_algorithm() in
kmp_dispatch.cpp and monitored an elapsed time in between. I was told
dynamic scheduling could be pretty expensive and thus I would expect a
contention around (a sort of conceptually) centralized work queue would be
expensive as we have more chunks and threads as discussed in the paper
[REF1].

However it seems it's pretty acceptable with moderate number of threads
(16-64) and a number of chunks per thread (16-64) unless I try too extreme
case such as chunk of 1. I took a quick look at the implementation and it
seemed to make sense since what we paid for dynamic scheduling as compared
to static scheduling is an atomic operation (test_then_inc_acq) and there
was not such a centralized queue substantiated.

I'd like to just double check what I've observed make sense to those who
contributed performance tune for the runtime implementation. Any comment on
your experience would be greatly appreciated!

FYI, I am working on Clang 7 and LLVM OpenMP 7.0.

[REF1] OpenMP task scheduling strategies for multicore NUMA systems
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20190112/722cd937/attachment.html>