[Mlir-commits] [mlir] [MLIR][OpenMP] Improve Generic-SPMD kernel detection (PR #137307)

Tue May 20 04:13:24 PDT 2025

skatrak wrote:

> I tried to replicate the issue in C, but that doesn't seem to work. Maybe we should compare the IR.

Thank you for checking this. It looks like that test has been fixed sometime after I created this PR, since I was able to reproduce failures with clang until I updated to the latest main branch. This other test that @Meinersbur made, however, does show another case where running in Generic-SPMD mode is currently required in order to get the expected results:

```c
#include <stdio.h>
#include <omp.h>

int main() {
  int i, j, a = 0, b = 0, c = 0, g = 21;

  #pragma omp target teams distribute thread_limit(10) private(i,j) reduction(+:a,b,c,g)
  for (i = 1; i <= 10; ++i) {
    j = i;
    if (j == 5) {
      g += 10 * omp_get_team_num() + omp_get_thread_num();
      ++c;
      j = 11;
    }
    if (j == 11) {
      #pragma omp parallel num_threads(10) reduction(+:a)
      {
        ++a;
      }
    } else {
      #pragma omp parallel num_threads(10) reduction(+:b)
      {
        ++b;
      }
    }
  }

  printf("a: %d\nb: %d\nc: %d\ng: %d", a, b, c, g);
  return 0;
}
```

On this, we get the following (same output for -O1, -O2 and -O3, since they all use Generic-SPMD):

```
clang -fopenmp --offload-arch=native test.c -O0 -o generic.O0 && ./generic.O0
a: 1
b: 9
c: 1
g: 61

"PluginInterface" device 0 info: Launching kernel __omp_offloading_10307_d124924_main_l29 with [10,1,1] blocks and [10,1,1] threads in Generic mode
AMDGPU device 0 info: #Args: 5 Teams x Thrds:   10x  10 (MaxFlatWorkGroupSize: 10) LDS Usage: 2280B #SGPRs/VGPRs: 76/59 #SGPR/VGPR Spills: 20/13 Tripcount: 10

clang -fopenmp --offload-arch=native test.c -O1 -o generic.O1 && ./generic.O1
a: 10
b: 90
c: 1
g: 61

"PluginInterface" device 0 info: Launching kernel __omp_offloading_10307_d124924_main_l29 with [10,1,1] blocks and [10,1,1] threads in Generic-SPMD mode
AMDGPU device 0 info: #Args: 5 Teams x Thrds:   10x  10 (MaxFlatWorkGroupSize: 10) LDS Usage: 1768B #SGPRs/VGPRs: 37/32 #SGPR/VGPR Spills: 0/0 Tripcount: 10
```

https://github.com/llvm/llvm-project/pull/137307