<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Hi Guilherme,<div class=""><br class=""></div><div class="">We do have some bugs on the target <font face="Courier New" class="">x86_64-pc-linux-gnu</font>. Existing test cases in <font face="Courier New" class="">libomptarget</font> can’t all pass (IIRC, three stable failures and one random failure). Therefore, it is expected to see some data racing or corruption on the target.</div><div class=""><br class=""><div class="">
<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div>Regards,<br class="">Shilei<br class=""></div></div>
</div>
<div><br class=""><blockquote type="cite" class=""><div class="">On Feb 12, 2021, at 12:39 PM, Guilherme Valarini via Openmp-dev <<a href="mailto:openmp-dev@lists.llvm.org" class="">openmp-dev@lists.llvm.org</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Hello everyone,<br class=""><br class="">I'm having some data corruption issues when using the generic-elf plugin on the program below (blocked matrix multiplication). I tried to use 3 builds to test this program: the release branches "release/11.x" and "release/12.x", and the main branch as well. I observed the following behavior:<br class=""><br class="">- release/11.x & main: the program works correctly with up to 4 OpenMP threads (OMP_NUM_THREADS=4), but with any number higher than that the result of the operation becomes incorrect. I believe that the problem may also happen with 2-4 threads, but with a lower likelihood to do so (of 500 executions, none have presented the problem);<br class="">- release/12.x: the program crashes due to a segfault inside a function called "__kmp_push_task" from OpenMP runtime regardless of the number of threads.<br class=""><br class="">The program was compiled with the following command after setting the environment variables to point to the correct clang build:<br class=""><br class="">"clang++ -fopenmp -fopenmp-targets=x86_64-pc-linux-gnu BlockMatMul.cpp"<br class=""><br class="">Does anyone know if this is an already known problem (e.g. multiple parallel mappings happening at the same time)? What about the "__kmp_push_task"?<br class=""><br class="">Thanks for the help,<br class="">Guilherme Valarini<br class=""><br class="">Here is the program (sorry I could not come up with a smaller example to post it here). I have dumped the task graph build by OpenMP in a dot/graphviz form and it seems to be correct with the indented dependencies found at the function "BlockMatMul_TargetNowait":<br class=""><br class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">#include <assert.h><br class="">#include <math.h><br class="">#include <stdio.h><br class="">#include <stdlib.h><br class="">#include <vector><br class="">#include <sys/time.h><br class="">#include <time.h><br class="">#include <unistd.h><br class="">class BlockMatrix {<br class="">private:<br class=""> const int rowsPerBlock;<br class=""> const int colsPerBlock;<br class=""> const long nRows;<br class=""> const long nCols;<br class=""> const int nBlocksPerRow;<br class=""> const int nBlocksPerCol;<br class=""> std::vector<std::vector<float *>> Blocks;<br class="">public:<br class=""> BlockMatrix(const int _rowsPerBlock, const int _colsPerBlock,<br class=""> const long _nRows, const long _nCols)<br class=""> : rowsPerBlock(_rowsPerBlock), colsPerBlock(_colsPerBlock), nRows(_nRows),<br class=""> nCols(_nCols), nBlocksPerRow(_nRows / _rowsPerBlock),<br class=""> nBlocksPerCol(_nCols / _colsPerBlock) {<br class=""> Blocks = std::vector<std::vector<float *>>(nBlocksPerCol);<br class=""> for (int i = 0; i < nBlocksPerCol; i++) {<br class=""> std::vector<float *> rowBlocks(nBlocksPerRow);<br class=""> for (int j = 0; j < nBlocksPerRow; j++) {<br class=""> rowBlocks[j] =<br class=""> (float *)calloc(_rowsPerBlock * _colsPerBlock, sizeof(float));<br class=""> }<br class=""> Blocks[i] = rowBlocks;<br class=""> }<br class=""> };<br class=""> ~BlockMatrix() {};<br class=""> // Initialize the BlockMatrix from 2D arrays<br class=""> void Initialize(float *matrix) {<br class=""> for (int i = 0; i < nBlocksPerCol; i++)<br class=""> for (int j = 0; j < nBlocksPerRow; j++) {<br class=""> float *CurrBlock = GetBlock(i, j);<br class=""> for (int ii = 0; ii < colsPerBlock; ++ii)<br class=""> for (int jj = 0; jj < rowsPerBlock; ++jj) {<br class=""> int curri = i * colsPerBlock + ii;<br class=""> int currj = j * rowsPerBlock + jj;<br class=""> CurrBlock[ii + jj * colsPerBlock] = matrix[curri + currj * nCols];<br class=""> }<br class=""> }<br class=""> }<br class=""> long Compare(float *matrix) {<br class=""> long fail=0;<br class=""> for (int i = 0; i < nBlocksPerCol; i++)<br class=""> for (int j = 0; j < nBlocksPerRow; j++) {<br class=""> float *CurrBlock = GetBlock(i, j);<br class=""> for (int ii = 0; ii < colsPerBlock; ++ii)<br class=""> for (int jj = 0; jj < rowsPerBlock; ++jj) {<br class=""> int curri = i * colsPerBlock + ii;<br class=""> int currj = j * rowsPerBlock + jj;<br class=""> float m_value = matrix[curri + currj * nCols];<br class=""> float bm_value = CurrBlock[ii + jj * colsPerBlock];<br class=""> if(bm_value != m_value){<br class=""> fprintf(stdout, "i,j = %d,%d\n", i, j);<br class=""> fprintf(stdout, "BlockMAT[%d][%d] = %f\n", ii, jj, bm_value);<br class=""> fprintf(stdout, "MAT[%d][%d] = %f\n", curri, currj, m_value);<br class=""> fail++;<br class=""> }<br class=""> }<br class=""> }<br class=""> // Print results<br class=""> printf("Non-Matching Block Outputs: %ld\n", fail);<br class=""> return fail;<br class=""> }<br class=""> float *GetBlock(int i, int j) {<br class=""> assert(i < nBlocksPerCol && j < nBlocksPerRow && "Accessing outside block");<br class=""> return Blocks[i][j];<br class=""> }<br class="">};<br class=""><br class="">#define BS 256<br class="">#define N 1024<br class=""><br class="">// Initialize matrices.<br class="">void init(float *a, float *b) {<br class=""> int i, j;<br class=""> for (i = 0; i < N; ++i) {<br class=""> for (j = 0; j < N; ++j) {<br class=""> a[i * N + j] = (float)i + j % 100;<br class=""> b[i * N + j] = (float)i + j % 100;<br class=""> }<br class=""> }<br class="">}<br class="">int BlockMatMul_TargetNowait(BlockMatrix &A, BlockMatrix &B, BlockMatrix &C) {<br class=""> #pragma omp parallel<br class=""> #pragma omp master<br class=""> for (int i = 0; i < N / BS; ++i)<br class=""> for (int j = 0; j < N / BS; ++j) {<br class=""> float *BlockC = C.GetBlock(i, j);<br class=""> for (int k = 0; k < N / BS; ++k) {<br class=""> float *BlockA = A.GetBlock(i, k);<br class=""> float *BlockB = B.GetBlock(k,j);<br class=""> #pragma omp target depend(in: BlockA[0], BlockB[0]) \<br class=""> depend(inout: BlockC[0]) \<br class=""> map(to: BlockA[:BS*BS], BlockB[:BS*BS]) \<br class=""> map(tofrom: BlockC[:BS*BS]) nowait<br class=""> #pragma omp parallel for<br class=""> for(int ii = 0; ii < BS; ii++)<br class=""> for(int jj = 0; jj < BS; jj++) {<br class=""> for(int kk = 0; kk < BS; ++kk)<br class=""> BlockC[ii + jj * BS] += BlockA[ii + kk * BS] * BlockB[kk + jj * BS];<br class=""> }<br class=""> }<br class=""> }<br class=""> return 0;<br class="">}<br class="">void Matmul(float *a, float *b, float *c) {<br class=""> for (int i = 0; i < N; ++i) {<br class=""> for (int j = 0; j < N; ++j) {<br class=""> float sum = 0.0;<br class=""> for (int k = 0; k < N; ++k) {<br class=""> sum = sum + a[i * N + k] * b[k * N + j];<br class=""> }<br class=""> c[i * N + j] = sum;<br class=""> }<br class=""> }<br class="">}<br class="">int main(int argc, char *argv[]) {<br class=""> double t_start, t_end;<br class=""> int ret = 0;<br class=""> float *a = (float *)malloc(sizeof(float) * N * N);<br class=""> float *b = (float *)malloc(sizeof(float) * N * N);<br class=""> float *c = (float *)calloc(sizeof(float), N * N);<br class=""> init(a, b);<br class=""> auto BlockedA = BlockMatrix(BS, BS, N, N);<br class=""> BlockedA.Initialize(a);<br class=""> BlockedA.Compare(a);<br class=""> auto BlockedB = BlockMatrix(BS, BS, N, N);<br class=""> BlockedB.Initialize(b);<br class=""> BlockedB.Compare(b);<br class=""> Matmul(a, b, c);<br class=""> auto BlockedC = BlockMatrix(BS, BS, N, N);<br class=""> BlockMatMul_TargetNowait(BlockedA, BlockedB, BlockedC);<br class=""> if(BlockedC.Compare(c) > 0) {<br class=""> // exit code to error if there is any missmatch<br class=""> ret = 1;<br class=""> }<br class=""> free(a);<br class=""> free(b);<br class=""> free(c);<br class=""> return ret;<br class="">}</blockquote></div>
_______________________________________________<br class="">Openmp-dev mailing list<br class=""><a href="mailto:Openmp-dev@lists.llvm.org" class="">Openmp-dev@lists.llvm.org</a><br class="">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev<br class=""></div></blockquote></div><br class=""></div></body></html>