<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Thanks. I opened a bug to track the issue. <a href="https://bugs.llvm.org/show_bug.cgi?id=49334" class="">https://bugs.llvm.org/show_bug.cgi?id=49334</a><div class=""><br class=""><div class="">
<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div>Regards,<br class="">Shilei<br class=""></div></div>
</div>
<div style=""><br class=""><blockquote type="cite" class=""><div class="">On Feb 19, 2021, at 4:38 PM, Hervé Yviquel <<a href="mailto:herve@ic.unicamp.br" class="">herve@ic.unicamp.br</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="">Hi all,</div><br class=""><div class="">So I took a deeper look at the problems mentioned by Guilherme and here are few observations:</div><br class=""><div class="">(1) The data corruption of the result of the BlockMatMul is not only happening with x86_64-pc-linux-gnu target but also with nvptx64-nvidia-cuda. So it seems the problems is coming from target-agnostic part of libomptarget and not specifically from the x86 plugin. Please notice the problem does not always appear so you might need to execute it multiple times. Reducing the number of omp threads sometimes helps to reproduce the problem with CUDA plugin.</div><br class=""><blockquote class="">export OMP_NUM_THREADS=2</blockquote><blockquote class="">clang++ -O3 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda BlockMatMul.cpp -o blockmatmul</blockquote><blockquote class="">for i in {1..100}; do ./blockmatmul || break; done</blockquote><br class=""><div class="">(2) The segfault in __kmp_push_task is only happening for x86_64-pc-linux-gnu target but it comes from a regression in libomp which seems to have been introduced with the <a href="https://github.com/llvm/llvm-project/commit/9d64275ae08fbdeeca0ce9c2f3951a2de6f38a08#diff-8402e656316eb873d5db4dea7f697406d15ae4197dcc60d88b3d9fc252fcb69a" title="https://github.com/llvm/llvm-project/commit/9d64275ae08fbdeeca0ce9c2f3951a2de6f38a08#diff-8402e656316eb873d5db4dea7f697406d15ae4197dcc60d88b3d9fc252fcb69a" class="">support for hidden helper task in RTL</a> : it is caused because the task_team pointer <a href="https://github.com/llvm/llvm-project/blob/6584a9a4c55e10c055f9f450798b826a9624d82f/openmp/runtime/src/kmp_tasking.cpp#L334" title="https://github.com/llvm/llvm-project/blob/6584a9a4c55e10c055f9f450798b826a9624d82f/openmp/runtime/src/kmp_tasking.cpp#L334" class="">here</a> is NULL. Maybe you guys have an idea on the best way to solve it.</div><br class=""><div class="">Best regards,</div><div class="">Hervé</div><br class=""><div class="gmail_quote_attribution">On févr. 13 2021, at 2:18 am, Shilei Tian via Openmp-dev <<a href="mailto:openmp-dev@lists.llvm.org" class="">openmp-dev@lists.llvm.org</a>> wrote:</div><blockquote class=""><div class="">Hi Guilherme,</div><div class=""><br class=""></div><div class="">We do have some bugs on the target <font style="font-family:"Courier New"" class="">x86_64-pc-linux-gnu</font>. Existing test cases in <font style="font-family:"Courier New"" class="">libomptarget</font> can’t all pass (IIRC, three stable failures and one random failure). Therefore, it is expected to see some data racing or corruption on the target.</div><div class=""><br class=""><div class=""><div class=""><div class=""><div class="">Regards,</div><div class="">Shilei</div></div></div></div><div class=""><br class=""><blockquote class=""><div class="">On Feb 12, 2021, at 12:39 PM, Guilherme Valarini via Openmp-dev <<a href="mailto:openmp-dev@lists.llvm.org" title="mailto:openmp-dev@lists.llvm.org" class="">openmp-dev@lists.llvm.org</a>> wrote:</div><br class=""><div class=""><div class=""><div class="">Hello everyone,</div><br class=""><div class="">I'm having some data corruption issues when using the generic-elf plugin on the program below (blocked matrix multiplication). I tried to use 3 builds to test this program: the release branches "release/11.x" and "release/12.x", and the main branch as well. I observed the following behavior:</div><br class=""><div class="">- release/11.x & main: the program works correctly with up to 4 OpenMP threads (OMP_NUM_THREADS=4), but with any number higher than that the result of the operation becomes incorrect. I believe that the problem may also happen with 2-4 threads, but with a lower likelihood to do so (of 500 executions, none have presented the problem);</div><div class="">- release/12.x: the program crashes due to a segfault inside a function called "__kmp_push_task" from OpenMP runtime regardless of the number of threads.</div><br class=""><div class="">The program was compiled with the following command after setting the environment variables to point to the correct clang build:</div><br class=""><div class="">"clang++ -fopenmp -fopenmp-targets=x86_64-pc-linux-gnu BlockMatMul.cpp"</div><br class=""><div class="">Does anyone know if this is an already known problem (e.g. multiple parallel mappings happening at the same time)? What about the "__kmp_push_task"?</div><br class=""><div class="">Thanks for the help,</div><div class="">Guilherme Valarini</div><br class=""><div class="">Here is the program (sorry I could not come up with a smaller example to post it here). I have dumped the task graph build by OpenMP in a dot/graphviz form and it seems to be correct with the indented dependencies found at the function "BlockMatMul_TargetNowait":</div><br class=""><blockquote class=""><div class="">#include <assert.h></div><div class="">#include <math.h></div><div class="">#include <stdio.h></div><div class="">#include <stdlib.h></div><div class="">#include <vector></div><div class="">#include <sys/time.h></div><div class="">#include <time.h></div><div class="">#include <unistd.h></div><div class="">class BlockMatrix {</div><div class="">private:</div><div class=""> const int rowsPerBlock;</div><div class=""> const int colsPerBlock;</div><div class=""> const long nRows;</div><div class=""> const long nCols;</div><div class=""> const int nBlocksPerRow;</div><div class=""> const int nBlocksPerCol;</div><div class=""> std::vector<std::vector<float *>> Blocks;</div><div class="">public:</div><div class=""> BlockMatrix(const int _rowsPerBlock, const int _colsPerBlock,</div><div class=""> const long _nRows, const long _nCols)</div><div class=""> : rowsPerBlock(_rowsPerBlock), colsPerBlock(_colsPerBlock), nRows(_nRows),</div><div class=""> nCols(_nCols), nBlocksPerRow(_nRows / _rowsPerBlock),</div><div class=""> nBlocksPerCol(_nCols / _colsPerBlock) {</div><div class=""> Blocks = std::vector<std::vector<float *>>(nBlocksPerCol);</div><div class=""> for (int i = 0; i < nBlocksPerCol; i++) {</div><div class=""> std::vector<float *> rowBlocks(nBlocksPerRow);</div><div class=""> for (int j = 0; j < nBlocksPerRow; j++) {</div><div class=""> rowBlocks[j] =</div><div class=""> (float *)calloc(_rowsPerBlock * _colsPerBlock, sizeof(float));</div><div class=""> }</div><div class=""> Blocks[i] = rowBlocks;</div><div class=""> }</div><div class=""> };</div><div class=""> ~BlockMatrix() {};</div><div class=""> // Initialize the BlockMatrix from 2D arrays</div><div class=""> void Initialize(float *matrix) {</div><div class=""> for (int i = 0; i < nBlocksPerCol; i++)</div><div class=""> for (int j = 0; j < nBlocksPerRow; j++) {</div><div class=""> float *CurrBlock = GetBlock(i, j);</div><div class=""> for (int ii = 0; ii < colsPerBlock; ++ii)</div><div class=""> for (int jj = 0; jj < rowsPerBlock; ++jj) {</div><div class=""> int curri = i * colsPerBlock + ii;</div><div class=""> int currj = j * rowsPerBlock + jj;</div><div class=""> CurrBlock[ii + jj * colsPerBlock] = matrix[curri + currj * nCols];</div><div class=""> }</div><div class=""> }</div><div class=""> }</div><div class=""> long Compare(float *matrix) {</div><div class=""> long fail=0;</div><div class=""> for (int i = 0; i < nBlocksPerCol; i++)</div><div class=""> for (int j = 0; j < nBlocksPerRow; j++) {</div><div class=""> float *CurrBlock = GetBlock(i, j);</div><div class=""> for (int ii = 0; ii < colsPerBlock; ++ii)</div><div class=""> for (int jj = 0; jj < rowsPerBlock; ++jj) {</div><div class=""> int curri = i * colsPerBlock + ii;</div><div class=""> int currj = j * rowsPerBlock + jj;</div><div class=""> float m_value = matrix[curri + currj * nCols];</div><div class=""> float bm_value = CurrBlock[ii + jj * colsPerBlock];</div><div class=""> if(bm_value != m_value){</div><div class=""> fprintf(stdout, "i,j = %d,%d\n", i, j);</div><div class=""> fprintf(stdout, "BlockMAT[%d][%d] = %f\n", ii, jj, bm_value);</div><div class=""> fprintf(stdout, "MAT[%d][%d] = %f\n", curri, currj, m_value);</div><div class=""> fail++;</div><div class=""> }</div><div class=""> }</div><div class=""> }</div><div class=""> // Print results</div><div class=""> printf("Non-Matching Block Outputs: %ld\n", fail);</div><div class=""> return fail;</div><div class=""> }</div><div class=""> float *GetBlock(int i, int j) {</div><div class=""> assert(i < nBlocksPerCol && j < nBlocksPerRow && "Accessing outside block");</div><div class=""> return Blocks[i][j];</div><div class=""> }</div><div class="">};</div><br class=""><div class="">#define BS 256</div><div class="">#define N 1024</div><br class=""><div class="">// Initialize matrices.</div><div class="">void init(float *a, float *b) {</div><div class=""> int i, j;</div><div class=""> for (i = 0; i < N; ++i) {</div><div class=""> for (j = 0; j < N; ++j) {</div><div class=""> a[i * N + j] = (float)i + j % 100;</div><div class=""> b[i * N + j] = (float)i + j % 100;</div><div class=""> }</div><div class=""> }</div><div class="">}</div><div class="">int BlockMatMul_TargetNowait(BlockMatrix &A, BlockMatrix &B, BlockMatrix &C) {</div><div class=""> #pragma omp parallel</div><div class=""> #pragma omp master</div><div class=""> for (int i = 0; i < N / BS; ++i)</div><div class=""> for (int j = 0; j < N / BS; ++j) {</div><div class=""> float *BlockC = C.GetBlock(i, j);</div><div class=""> for (int k = 0; k < N / BS; ++k) {</div><div class=""> float *BlockA = A.GetBlock(i, k);</div><div class=""> float *BlockB = B.GetBlock(k,j);</div><div class=""> #pragma omp target depend(in: BlockA[0], BlockB[0]) \</div><div class=""> depend(inout: BlockC[0]) \</div><div class=""> map(to: BlockA[:BS*BS], BlockB[:BS*BS]) \</div><div class=""> map(tofrom: BlockC[:BS*BS]) nowait</div><div class=""> #pragma omp parallel for</div><div class=""> for(int ii = 0; ii < BS; ii++)</div><div class=""> for(int jj = 0; jj < BS; jj++) {</div><div class=""> for(int kk = 0; kk < BS; ++kk)</div><div class=""> BlockC[ii + jj * BS] += BlockA[ii + kk * BS] * BlockB[kk + jj * BS];</div><div class=""> }</div><div class=""> }</div><div class=""> }</div><div class=""> return 0;</div><div class="">}</div><div class="">void Matmul(float *a, float *b, float *c) {</div><div class=""> for (int i = 0; i < N; ++i) {</div><div class=""> for (int j = 0; j < N; ++j) {</div><div class=""> float sum = 0.0;</div><div class=""> for (int k = 0; k < N; ++k) {</div><div class=""> sum = sum + a[i * N + k] * b[k * N + j];</div><div class=""> }</div><div class=""> c[i * N + j] = sum;</div><div class=""> }</div><div class=""> }</div><div class="">}</div><div class="">int main(int argc, char *argv[]) {</div><div class=""> double t_start, t_end;</div><div class=""> int ret = 0;</div><div class=""> float *a = (float *)malloc(sizeof(float) * N * N);</div><div class=""> float *b = (float *)malloc(sizeof(float) * N * N);</div><div class=""> float *c = (float *)calloc(sizeof(float), N * N);</div><div class=""> init(a, b);</div><div class=""> auto BlockedA = BlockMatrix(BS, BS, N, N);</div><div class=""> BlockedA.Initialize(a);</div><div class=""> BlockedA.Compare(a);</div><div class=""> auto BlockedB = BlockMatrix(BS, BS, N, N);</div><div class=""> BlockedB.Initialize(b);</div><div class=""> BlockedB.Compare(b);</div><div class=""> Matmul(a, b, c);</div><div class=""> auto BlockedC = BlockMatrix(BS, BS, N, N);</div><div class=""> BlockMatMul_TargetNowait(BlockedA, BlockedB, BlockedC);</div><div class=""> if(BlockedC.Compare(c) > 0) {</div><div class=""> // exit code to error if there is any missmatch</div><div class=""> ret = 1;</div><div class=""> }</div><div class=""> free(a);</div><div class=""> free(b);</div><div class=""> free(c);</div><div class=""> return ret;</div><div class="">}</div></blockquote></div><div class="">_______________________________________________</div><div class="">Openmp-dev mailing list</div><div class=""><a href="mailto:Openmp-dev@lists.llvm.org" title="mailto:Openmp-dev@lists.llvm.org" class="">Openmp-dev@lists.llvm.org</a></div><div class=""><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" class="">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a></div></div></blockquote></div><br class=""></div><div class="">_______________________________________________</div><div class="">Openmp-dev mailing list</div><div class=""><a href="mailto:Openmp-dev@lists.llvm.org" class="">Openmp-dev@lists.llvm.org</a></div><div class=""><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" class="">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a></div></blockquote></div></blockquote></div><br class=""></div></body></html>