[Mlir-commits] [mlir] [OpenMP][mlir] Add Groupprivate op in omp dialect. (PR #162704)

Tue Dec 2 08:50:55 PST 2025

skc7 wrote:

> > Thanks @ergawy for finding the issue in shared memory allocation and deallocation in lowering of omp.groupprivate.
> > The current implementation directly lowers `omp.groupprivate` to `__kmpc_alloc_shared()` calls, which is incorrect per the OpenMP specification.
> > As per OpenMP spec, groupprivate variables should have **one copy per contention group (team)**, shared by all threads within that team.
> > **For GPU targets (device):**
> > ```
> > * Groupprivate is equivalent to GPU shared memory (LDS on AMD)
> > 
> > * Could be lowered to a global variable in **address space 3** (NVPTX/AMDGPU shared memory) during OpenMPToLLVMIRTranslation.
> > 
> > * This automatically provides one copy per team, shared by all threads
> > ```
> > 
> > 
> >     
> >       
> >     
> > 
> >       
> >     
> > 
> >     
> >   
> > But, I'm unaware of the case of lowering when groupprivate variable is accessed on host. Do we have any runtime call to handle this?
> > CC: @mjklemm
> 
> Looking into this a bit further, for the GPU, I think it still makes sense to allocate/deallocate `groupprivate` variables with a pair of `__kmpc_alloc_shared`/`__kmpc_free_shared` calls. `__kmpc_alloc_shared` still allocates in LDS (see [here](https://github.com/llvm/llvm-project/blob/main/openmp/device/src/State.cpp#L145), [here](https://github.com/llvm/llvm-project/blob/main/openmp/device/src/State.cpp#L93), [here](https://github.com/llvm/llvm-project/blob/main/openmp/device/include/DeviceTypes.h#L21), and [here](https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/amdgpuintrin.h#L26)). The difference is that we will have control over the lifetime of the `groupprivate` variable:
> 
> > 7.13 groupprivate Directive
> > ...
> > ... The lifetime of a groupprivate variable is limited to the lifetime of all tasks in the contention group.
> 
> I am thinking of a scenario like this:
> 
> ```fortran
> !$omp target
>   !$omp parallel
>     !$omp groupprivate(x)
>   !$omp end parallel
>   ... the groupprivate allocation should not extend beyond this point ...
> !$omp end target
> ```
> 
> Using a global variable would result in extending the lifetime of the `groupprivate` value beyond the parallel region.
> 
> Finding a proper insertion point for the `__kmpc_free_shared`, however, can be challenging. This will need a bit of more thinking (unless I am not seeing an obvious solution) to handle, for example, standalone `groupprivate` directives not nested inside any OpenMP ops.
> 
> If we go the direction of emitting a pair of `alloc/free` calls, then we have a solution for the CPU as well since we can allocate on the heap when we encounter the directive and free at the end of the proper scope just like the GPU.
> 
> Take this with a grain of salt since I am also new to the `groupprivate` concept and let me know if I missed anything.

Hi @ergawy. Thanks for suggestion. 

As per my understanding,

- `groupprivate` is declarative directive and must appear at module scope as per OpenMP spec and not within executable constructs.
- Addrspace(3) shared global variables matches the spec requirement: "The lifetime of a groupprivate variable is limited to the lifetime of all tasks in the contention group."
  - LDS variables are per-kernel and per-team/workgroup
  - Each team gets its own copy (created at team start, destroyed at team end)
  - Lifetime is automatically limited to the contention group
- On GPUs, inserting kmpc_alloc_shared/kmpc_free_shared requires a strategy like below.
  - At the prologue of the kernel, pick the thread with (0,0,0) index and only this thread does the allocation. All the other threads wait for this thread to finish (barrier insertion required).
  - Threads proceed with execution.
  - Again, at the epilogue of the kernel same thread with (0,0,0) index calls for free of previous allocation after all threads are done execution (barrier insertion again).
  - I have previously implemented this approach for asan related LDS lowering in AMDGPU backend in `amdgpu-sw-lower-lds` pass [LINK](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp)
  - This is possible at llvm IR level, Not sure how this could be achieved at mlir to llvm lowering stage.

https://github.com/llvm/llvm-project/pull/162704