[Openmp-dev] [libomptarget] Data corruption on target nowait with more than 4 threads

Hervé Yviquel via Openmp-dev openmp-dev at lists.llvm.org
Fri Feb 19 13:38:18 PST 2021


Hi all,

So I took a deeper look at the problems mentioned by Guilherme and here are few observations:
(1) The data corruption of the result of the BlockMatMul is not only happening with x86_64-pc-linux-gnu target but also with nvptx64-nvidia-cuda. So it seems the problems is coming from target-agnostic part of libomptarget and not specifically from the x86 plugin. Please notice the problem does not always appear so you might need to execute it multiple times. Reducing the number of omp threads sometimes helps to reproduce the problem with CUDA plugin.
> export OMP_NUM_THREADS=2
> clang++ -O3 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda BlockMatMul.cpp -o blockmatmul
> for i in {1..100}; do ./blockmatmul || break; done

(2) The segfault in __kmp_push_task is only happening for x86_64-pc-linux-gnu target but it comes from a regression in libomp which seems to have been introduced with the support for hidden helper task in RTL (https://github.com/llvm/llvm-project/commit/9d64275ae08fbdeeca0ce9c2f3951a2de6f38a08#diff-8402e656316eb873d5db4dea7f697406d15ae4197dcc60d88b3d9fc252fcb69a) : it is caused because the task_team pointer here (https://github.com/llvm/llvm-project/blob/6584a9a4c55e10c055f9f450798b826a9624d82f/openmp/runtime/src/kmp_tasking.cpp#L334) is NULL. Maybe you guys have an idea on the best way to solve it.
Best regards,
Hervé

On févr. 13 2021, at 2:18 am, Shilei Tian via Openmp-dev <openmp-dev at lists.llvm.org> wrote:
> Hi Guilherme,
>
> We do have some bugs on the target x86_64-pc-linux-gnu. Existing test cases in libomptarget can’t all pass (IIRC, three stable failures and one random failure). Therefore, it is expected to see some data racing or corruption on the target.
>
> Regards,
> Shilei
>
>
>
>
> > On Feb 12, 2021, at 12:39 PM, Guilherme Valarini via Openmp-dev <openmp-dev at lists.llvm.org (mailto:openmp-dev at lists.llvm.org)> wrote:
> > Hello everyone,
> > I'm having some data corruption issues when using the generic-elf plugin on the program below (blocked matrix multiplication). I tried to use 3 builds to test this program: the release branches "release/11.x" and "release/12.x", and the main branch as well. I observed the following behavior:
> > - release/11.x & main: the program works correctly with up to 4 OpenMP threads (OMP_NUM_THREADS=4), but with any number higher than that the result of the operation becomes incorrect. I believe that the problem may also happen with 2-4 threads, but with a lower likelihood to do so (of 500 executions, none have presented the problem);
> > - release/12.x: the program crashes due to a segfault inside a function called "__kmp_push_task" from OpenMP runtime regardless of the number of threads.
> >
> > The program was compiled with the following command after setting the environment variables to point to the correct clang build:
> > "clang++ -fopenmp -fopenmp-targets=x86_64-pc-linux-gnu BlockMatMul.cpp"
> > Does anyone know if this is an already known problem (e.g. multiple parallel mappings happening at the same time)? What about the "__kmp_push_task"?
> > Thanks for the help,
> > Guilherme Valarini
> >
> > Here is the program (sorry I could not come up with a smaller example to post it here). I have dumped the task graph build by OpenMP in a dot/graphviz form and it seems to be correct with the indented dependencies found at the function "BlockMatMul_TargetNowait":
> > > #include <assert.h>
> > > #include <math.h>
> > > #include <stdio.h>
> > > #include <stdlib.h>
> > > #include <vector>
> > > #include <sys/time.h>
> > > #include <time.h>
> > > #include <unistd.h>
> > > class BlockMatrix {
> > > private:
> > > const int rowsPerBlock;
> > > const int colsPerBlock;
> > > const long nRows;
> > > const long nCols;
> > > const int nBlocksPerRow;
> > > const int nBlocksPerCol;
> > > std::vector<std::vector<float *>> Blocks;
> > > public:
> > > BlockMatrix(const int _rowsPerBlock, const int _colsPerBlock,
> > > const long _nRows, const long _nCols)
> > > : rowsPerBlock(_rowsPerBlock), colsPerBlock(_colsPerBlock), nRows(_nRows),
> > > nCols(_nCols), nBlocksPerRow(_nRows / _rowsPerBlock),
> > > nBlocksPerCol(_nCols / _colsPerBlock) {
> > > Blocks = std::vector<std::vector<float *>>(nBlocksPerCol);
> > > for (int i = 0; i < nBlocksPerCol; i++) {
> > > std::vector<float *> rowBlocks(nBlocksPerRow);
> > > for (int j = 0; j < nBlocksPerRow; j++) {
> > > rowBlocks[j] =
> > > (float *)calloc(_rowsPerBlock * _colsPerBlock, sizeof(float));
> > > }
> > > Blocks[i] = rowBlocks;
> > > }
> > > };
> > > ~BlockMatrix() {};
> > > // Initialize the BlockMatrix from 2D arrays
> > > void Initialize(float *matrix) {
> > > for (int i = 0; i < nBlocksPerCol; i++)
> > > for (int j = 0; j < nBlocksPerRow; j++) {
> > > float *CurrBlock = GetBlock(i, j);
> > > for (int ii = 0; ii < colsPerBlock; ++ii)
> > > for (int jj = 0; jj < rowsPerBlock; ++jj) {
> > > int curri = i * colsPerBlock + ii;
> > > int currj = j * rowsPerBlock + jj;
> > > CurrBlock[ii + jj * colsPerBlock] = matrix[curri + currj * nCols];
> > > }
> > > }
> > > }
> > > long Compare(float *matrix) {
> > > long fail=0;
> > > for (int i = 0; i < nBlocksPerCol; i++)
> > > for (int j = 0; j < nBlocksPerRow; j++) {
> > > float *CurrBlock = GetBlock(i, j);
> > > for (int ii = 0; ii < colsPerBlock; ++ii)
> > > for (int jj = 0; jj < rowsPerBlock; ++jj) {
> > > int curri = i * colsPerBlock + ii;
> > > int currj = j * rowsPerBlock + jj;
> > > float m_value = matrix[curri + currj * nCols];
> > > float bm_value = CurrBlock[ii + jj * colsPerBlock];
> > > if(bm_value != m_value){
> > > fprintf(stdout, "i,j = %d,%d\n", i, j);
> > > fprintf(stdout, "BlockMAT[%d][%d] = %f\n", ii, jj, bm_value);
> > > fprintf(stdout, "MAT[%d][%d] = %f\n", curri, currj, m_value);
> > > fail++;
> > > }
> > > }
> > > }
> > > // Print results
> > > printf("Non-Matching Block Outputs: %ld\n", fail);
> > > return fail;
> > > }
> > > float *GetBlock(int i, int j) {
> > > assert(i < nBlocksPerCol && j < nBlocksPerRow && "Accessing outside block");
> > > return Blocks[i][j];
> > > }
> > > };
> > >
> > > #define BS 256
> > > #define N 1024
> > >
> > > // Initialize matrices.
> > > void init(float *a, float *b) {
> > > int i, j;
> > > for (i = 0; i < N; ++i) {
> > > for (j = 0; j < N; ++j) {
> > > a[i * N + j] = (float)i + j % 100;
> > > b[i * N + j] = (float)i + j % 100;
> > > }
> > > }
> > > }
> > > int BlockMatMul_TargetNowait(BlockMatrix &A, BlockMatrix &B, BlockMatrix &C) {
> > > #pragma omp parallel
> > > #pragma omp master
> > > for (int i = 0; i < N / BS; ++i)
> > > for (int j = 0; j < N / BS; ++j) {
> > > float *BlockC = C.GetBlock(i, j);
> > > for (int k = 0; k < N / BS; ++k) {
> > > float *BlockA = A.GetBlock(i, k);
> > > float *BlockB = B.GetBlock(k,j);
> > > #pragma omp target depend(in: BlockA[0], BlockB[0]) \
> > > depend(inout: BlockC[0]) \
> > > map(to: BlockA[:BS*BS], BlockB[:BS*BS]) \
> > > map(tofrom: BlockC[:BS*BS]) nowait
> > > #pragma omp parallel for
> > > for(int ii = 0; ii < BS; ii++)
> > > for(int jj = 0; jj < BS; jj++) {
> > > for(int kk = 0; kk < BS; ++kk)
> > > BlockC[ii + jj * BS] += BlockA[ii + kk * BS] * BlockB[kk + jj * BS];
> > > }
> > > }
> > > }
> > > return 0;
> > > }
> > > void Matmul(float *a, float *b, float *c) {
> > > for (int i = 0; i < N; ++i) {
> > > for (int j = 0; j < N; ++j) {
> > > float sum = 0.0;
> > > for (int k = 0; k < N; ++k) {
> > > sum = sum + a[i * N + k] * b[k * N + j];
> > > }
> > > c[i * N + j] = sum;
> > > }
> > > }
> > > }
> > > int main(int argc, char *argv[]) {
> > > double t_start, t_end;
> > > int ret = 0;
> > > float *a = (float *)malloc(sizeof(float) * N * N);
> > > float *b = (float *)malloc(sizeof(float) * N * N);
> > > float *c = (float *)calloc(sizeof(float), N * N);
> > > init(a, b);
> > > auto BlockedA = BlockMatrix(BS, BS, N, N);
> > > BlockedA.Initialize(a);
> > > BlockedA.Compare(a);
> > > auto BlockedB = BlockMatrix(BS, BS, N, N);
> > > BlockedB.Initialize(b);
> > > BlockedB.Compare(b);
> > > Matmul(a, b, c);
> > > auto BlockedC = BlockMatrix(BS, BS, N, N);
> > > BlockMatMul_TargetNowait(BlockedA, BlockedB, BlockedC);
> > > if(BlockedC.Compare(c) > 0) {
> > > // exit code to error if there is any missmatch
> > > ret = 1;
> > > }
> > > free(a);
> > > free(b);
> > > free(c);
> > > return ret;
> > > }
> >
> >
> > _______________________________________________
> > Openmp-dev mailing list
> > Openmp-dev at lists.llvm.org (mailto:Openmp-dev at lists.llvm.org)
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
> >
>
>
>
> _______________________________________________
> Openmp-dev mailing list
> Openmp-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20210219/c870bc67/attachment.html>


More information about the Openmp-dev mailing list