<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - [OpenMP] Teams distribute parallel for nested inside parallel for"

   href="https://bugs.llvm.org/show_bug.cgi?id=48330">48330</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>[OpenMP] Teams distribute parallel for nested inside parallel for

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>OpenMP

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>unspecified

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Runtime Library

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>rofirrim@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Created <span class=""><a href="attachment.cgi?id=24219" name="attach_24219" title="Testcase">attachment 24219</a> <a href="attachment.cgi?id=24219&action=edit" title="Testcase">[details]</a></span>

Testcase

Hi all,

the testcase (based on offloading/parallel_offloading_map.cpp) below seems to

not to distribute all the iterations of the innermost loop. I would expect all

the elements of array `tmp` be updated, however we seem to distribute just a

single iteration because we believe we have more threads than the ones we

actually have.

I can reproduce this with libomptarget.rtl.x86_64.so (running in the host)

Running the testcase with OMP_NUM_THREADS=1 works correctly.

Running with OMP_NUM_THREADS=2 I obtain this (for each iteration of the

outer-loop).

[TARGET][0] || tmp[0] <- 1

[TARGET][0] || tmp[1] <- 1

[TARGET][0] || tmp[2] <- 1

[TARGET][0] || tmp[3] <- 1

Error at tmp[4]

Error at tmp[5]

Error at tmp[6]

Error at tmp[7]

Running with OMP_NUM_THREADS=4 I obtain this

[TARGET][0] || tmp[0] <- 1

[TARGET][0] || tmp[1] <- 1

Error at tmp[2]

Error at tmp[3]

Error at tmp[4]

Error at tmp[5]

Error at tmp[6]

Error at tmp[7]

And so on.

My expectation is that all the iterations of M should be executed.

I tried to debug a bit and I'm not sure to understand all of it. So far I see

that __kmpc_fork_teams invokes __kmp_fork_call which decides that

nthreads is going to be 1.

if (parent_team->t.t_active_level >=                           

    master_th->th.th_current_task->td_icvs.max_active_levels) {

  nthreads = 1;                                                

} else {                                                       

Then __kmp_invoke_teams_master → __kmp_teams_master → __kmp_fork_call which

again sets nthreads to 1 (for the same reason). Now we go through the

serialized parallel code path of __kmp_fork_call and this time we eventually

invoke the microtask. The microtask eventually invokes __kmpc_for_static_init_4

with `*plower == 0` and `*pupper == 7` which seems correct. However when

computing the chunk, we are confused by the fact that team->t.t_nproc is not 1.

We seem to be looking at the parent team because this is a distribute schedule

if (schedtype > kmp_ord_upper) {

  // we are in DISTRIBUTE construct

  schedtype += kmp_sch_static -

               kmp_distribute_static; // AC: convert to usual schedule type

  tid = th->th.th_team->t.t_master_tid;

  team = th->th.th_team->t.t_parent;  // this team was the one available

} else {

And now we compute a smaller chunk even if, apparently, we will execute with a

single thread. I am not sure at what point we got the number of threads wrong.

I'm using the following command line against a standalone build of openmp

(based on the mentioned test from lit)

clang++ -O0 -g -fno-experimental-isel -fopenmp -pthread  \

   -I <top-llvm-srcdir>/openmp/libomptarget/test \

   -I <openmp-builddir>/libomptarget/../runtime/src \

   -L <openmp-builddir>/libomptarget \

   -L <openmp-builddir>/libomptarget/../runtime/src \

   -fopenmp-targets=x86_64-pc-linux-gnu t.cpp -o t \

   -Wl,-rpath,<openmp-builddir>/libomptarget/../runtime/src 

OMP_NUM_THREADS=2 ./t

Kind regards,

// -- t.cpp

#include "omp.h"

#include <cassert>

#include <cstdio>

int main(int argc, char *argv[]) {

  constexpr const int N = 4, M = 8;

  bool error = false;

#pragma omp parallel for

  for (int i = 0; i < N; ++i) { // outer-loop

    int tmp[M] = {0};

    // This optional critical helps debugging, you can remove it.

    #pragma omp critical

    {

#pragma omp target teams distribute parallel for map(tofrom : tmp)

      for (int j = 0; j < M; ++j) {

        printf("[TARGET][%d] || tmp[%d] <- 1\n", omp_get_thread_num(), j);

        tmp[j] += 1;

      }

      // Check

      for (int j = 0; j < M; ++j) {

        if (tmp[j] != 1) {

          printf("Error at tmp[%d]\n", j);

          error = true;

        }

      }

    } // critical

  }

  printf("%s\n", error ? "ERROR" : "PASS");

  return 0;

}

// -- end of t.cpp</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>