<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - Regression: OpenMP team private array accessed out of bounds"
href="https://bugs.llvm.org/show_bug.cgi?id=48794">48794</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>Regression: OpenMP team private array accessed out of bounds
</td>
</tr>
<tr>
<th>Product</th>
<td>OpenMP
</td>
</tr>
<tr>
<th>Version</th>
<td>unspecified
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Clang Compiler Support
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedclangbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>csdaley@lbl.gov
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>Created <span class=""><a href="attachment.cgi?id=24396" name="attach_24396" title="Source file to reproduce the bug - compile with -DTILED_COPY">attachment 24396</a> <a href="attachment.cgi?id=24396&action=edit" title="Source file to reproduce the bug - compile with -DTILED_COPY">[details]</a></span>
Source file to reproduce the bug - compile with -DTILED_COPY
There has been a regression in the master branch during December between commit
bc7a61b7036044636f9a2c91c916153532a551f8 (success) and commit
f75bf712deecc1be157f7ba6e2fd80922942546c (failure): 53 commits between these
two commits. The issue continues to be present today (tested commit
a89242d874df72cddeafbebc75ac377371e72796). The simplified reproducer program
named "copy.c" is attached. It involves copying data between two arrays using
an OpenMP team private array as an intermediary. The function causing the issue
is named "copySM" -- it is used in the executable when compiling with
-DTILED_COPY.
The error message when executing on an NVIDIA V100 is as follows:
```
+ clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wall -Wno-openmp-mapping
-Wno-unknown-cuda-version -g -DTILED_COPY -o copy copy.c
+ srun -n 1 ./copy
CUDA error: Error when synchronizing stream. stream = 0x0000000000fbcf90, async
info ptr = 0x00007fffffffb618
CUDA error: an illegal memory access was encountered
Libomptarget error: Failed to synchronize device.
Libomptarget error: Call to targetDataEnd failed, abort target.
Libomptarget error: Failed to process data after launching the kernel.
copy.c:60:1: Libomptarget fatal error 1: failure of target construct while
offloading is mandatory
```
There is more information when using cuda-memcheck. It shows "Out-of-range
Shared or Local Address". This is consistent with my expectation that OpenMP
team private arrays would be placed in GPU shared memory.
```
+ clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wall -Wno-openmp-mapping
-Wno-unknown-cuda-version -g -DTILED_COPY -o copy copy.c
+ srun -n 1 cuda-memcheck ./copy
CUDA error: Error when synchronizing stream. stream = 0x0000000002f3cf20, async
info ptr = 0x00007fffffffb5a8
CUDA error: unspecified launch failure
Libomptarget error: Failed to synchronize device.
Libomptarget error: Call to targetDataEnd failed, abort target.
Libomptarget error: Failed to process data after launching the kernel.
copy.c:60:1: Libomptarget fatal error 1: failure of target construct while
offloading is mandatory
========= CUDA-MEMCHECK
========= Out-of-range Shared or Local Address
========= at 0x000000e0 in __cuda_syscall_mc_dyn_globallock_check
========= by thread (96,0,0) in block (241,0,0)
========= Device Frame:__kmpc_data_sharing_push_stack
(__kmpc_data_sharing_push_stack : 0xe60)
========= Saved host backtrace up to driver entry point at kernel launch
time
========= Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x34e)
[0x2e46de]
========= Host
Frame:/project/projectdirs/m1759/csdaley/software/cgpu/llvm/12.0.0-git_20210116/lib/libomptarget.rtl.cuda.so
(__tgt_rtl_run_target_team_region_async + 0x2fb) [0x580b]
========= Host
Frame:/project/projectdirs/m1759/csdaley/software/cgpu/llvm/12.0.0-git_20210116/lib/libomptarget.so.12git
[0x17ca3]
========= Host
Frame:/project/projectdirs/m1759/csdaley/software/cgpu/llvm/12.0.0-git_20210116/lib/libomptarget.so.12git
(__tgt_target_teams_mapper + 0x15c) [0x11b6c]
========= Host Frame:./copy [0xb42]
========= Host Frame:./copy [0x12b0]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xea) [0x20f8a]
========= Host Frame:./copy [0x9aa]
=========
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified
launch failure" on CUDA API call to cuStreamSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/libcuda.so.1 (cuStreamSynchronize + 0x18e)
[0x2e418e]
========= Host
Frame:/project/projectdirs/m1759/csdaley/software/cgpu/llvm/12.0.0-git_20210116/lib/libomptarget.rtl.cuda.so
(__tgt_rtl_synchronize + 0x1f) [0x4e0f]
========= Host
Frame:/project/projectdirs/m1759/csdaley/software/cgpu/llvm/12.0.0-git_20210116/lib/libomptarget.so.12git
[0x93c9]
========= Host
Frame:/project/projectdirs/m1759/csdaley/software/cgpu/llvm/12.0.0-git_20210116/lib/libomptarget.so.12git
[0x15ef0]
========= Host
Frame:/project/projectdirs/m1759/csdaley/software/cgpu/llvm/12.0.0-git_20210116/lib/libomptarget.so.12git
[0x17cf2]
========= Host
Frame:/project/projectdirs/m1759/csdaley/software/cgpu/llvm/12.0.0-git_20210116/lib/libomptarget.so.12git
(__tgt_target_teams_mapper + 0x15c) [0x11b6c]
========= Host Frame:./copy [0xb42]
========= Host Frame:./copy [0x12b0]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xea) [0x20f8a]
========= Host Frame:./copy [0x9aa]
=========
========= Error: process didn't terminate successfully
========= No CUDA-MEMCHECK results found
```
The error seems to be related to the size of the data type used for the team
private array. I only get an error when using an 8 byte type (double, long).
The error goes away when I use a 4-byte type (int, float). The function looks
like the following:
```
void copySM(T* out, const T* in)
{
#pragma omp target teams distribute collapse(2)
for (int xtile = 0; xtile < NX; xtile += TILE_DIM) {
for (int ytile = 0; ytile < NY; ytile += TILE_DIM) {
T sm[TILE_DIM * TILE_DIM]; // This fails when an 8-byte type
#pragma omp parallel for collapse(2)
for (int xs = 0; xs < TILE_DIM; ++xs) {
for (int ys = 0; ys < TILE_DIM; ++ys) {
int xg = xs + xtile;
int yg = ys + ytile;
sm[SIDX(xs,ys)] = in[GIDX(xg,yg)];
}
} /* Implicit barrier here */
#pragma omp parallel for collapse(2)
for (int xs = 0; xs < TILE_DIM; ++xs) {
for (int ys = 0; ys < TILE_DIM; ++ys) {
int xg = xs + xtile;
int yg = ys + ytile;
out[GIDX(xg,yg)] = sm[SIDX(xs,ys)];
}
}
}
}
}
```</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>