<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - [OpenMP] Teams distribute parallel for nested inside parallel for"
href="https://bugs.llvm.org/show_bug.cgi?id=48330">48330</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>[OpenMP] Teams distribute parallel for nested inside parallel for
</td>
</tr>
<tr>
<th>Product</th>
<td>OpenMP
</td>
</tr>
<tr>
<th>Version</th>
<td>unspecified
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Runtime Library
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>rofirrim@gmail.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>Created <span class=""><a href="attachment.cgi?id=24219" name="attach_24219" title="Testcase">attachment 24219</a> <a href="attachment.cgi?id=24219&action=edit" title="Testcase">[details]</a></span>
Testcase
Hi all,
the testcase (based on offloading/parallel_offloading_map.cpp) below seems to
not to distribute all the iterations of the innermost loop. I would expect all
the elements of array `tmp` be updated, however we seem to distribute just a
single iteration because we believe we have more threads than the ones we
actually have.
I can reproduce this with libomptarget.rtl.x86_64.so (running in the host)
Running the testcase with OMP_NUM_THREADS=1 works correctly.
Running with OMP_NUM_THREADS=2 I obtain this (for each iteration of the
outer-loop).
[TARGET][0] || tmp[0] <- 1
[TARGET][0] || tmp[1] <- 1
[TARGET][0] || tmp[2] <- 1
[TARGET][0] || tmp[3] <- 1
Error at tmp[4]
Error at tmp[5]
Error at tmp[6]
Error at tmp[7]
Running with OMP_NUM_THREADS=4 I obtain this
[TARGET][0] || tmp[0] <- 1
[TARGET][0] || tmp[1] <- 1
Error at tmp[2]
Error at tmp[3]
Error at tmp[4]
Error at tmp[5]
Error at tmp[6]
Error at tmp[7]
And so on.
My expectation is that all the iterations of M should be executed.
I tried to debug a bit and I'm not sure to understand all of it. So far I see
that __kmpc_fork_teams invokes __kmp_fork_call which decides that
nthreads is going to be 1.
if (parent_team->t.t_active_level >=
master_th->th.th_current_task->td_icvs.max_active_levels) {
nthreads = 1;
} else {
Then __kmp_invoke_teams_master → __kmp_teams_master → __kmp_fork_call which
again sets nthreads to 1 (for the same reason). Now we go through the
serialized parallel code path of __kmp_fork_call and this time we eventually
invoke the microtask. The microtask eventually invokes __kmpc_for_static_init_4
with `*plower == 0` and `*pupper == 7` which seems correct. However when
computing the chunk, we are confused by the fact that team->t.t_nproc is not 1.
We seem to be looking at the parent team because this is a distribute schedule
if (schedtype > kmp_ord_upper) {
// we are in DISTRIBUTE construct
schedtype += kmp_sch_static -
kmp_distribute_static; // AC: convert to usual schedule type
tid = th->th.th_team->t.t_master_tid;
team = th->th.th_team->t.t_parent; // this team was the one available
} else {
And now we compute a smaller chunk even if, apparently, we will execute with a
single thread. I am not sure at what point we got the number of threads wrong.
I'm using the following command line against a standalone build of openmp
(based on the mentioned test from lit)
clang++ -O0 -g -fno-experimental-isel -fopenmp -pthread \
-I <top-llvm-srcdir>/openmp/libomptarget/test \
-I <openmp-builddir>/libomptarget/../runtime/src \
-L <openmp-builddir>/libomptarget \
-L <openmp-builddir>/libomptarget/../runtime/src \
-fopenmp-targets=x86_64-pc-linux-gnu t.cpp -o t \
-Wl,-rpath,<openmp-builddir>/libomptarget/../runtime/src
OMP_NUM_THREADS=2 ./t
Kind regards,
// -- t.cpp
#include "omp.h"
#include <cassert>
#include <cstdio>
int main(int argc, char *argv[]) {
constexpr const int N = 4, M = 8;
bool error = false;
#pragma omp parallel for
for (int i = 0; i < N; ++i) { // outer-loop
int tmp[M] = {0};
// This optional critical helps debugging, you can remove it.
#pragma omp critical
{
#pragma omp target teams distribute parallel for map(tofrom : tmp)
for (int j = 0; j < M; ++j) {
printf("[TARGET][%d] || tmp[%d] <- 1\n", omp_get_thread_num(), j);
tmp[j] += 1;
}
// Check
for (int j = 0; j < M; ++j) {
if (tmp[j] != 1) {
printf("Error at tmp[%d]\n", j);
error = true;
}
}
} // critical
}
printf("%s\n", error ? "ERROR" : "PASS");
return 0;
}
// -- end of t.cpp</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>