[llvm-branch-commits] [flang] [mlir] [Flang][OpenMP] Add pass to replace allocas with device shared memory (PR #161863)

Thu Feb 26 06:16:36 PST 2026

skatrak wrote:

> One question though: How do we know that the semantics of a particular `llvm.alloca` is meant to be shared memory? This PR is inferring it from its uses, but this doesn't necessarily mean oit is the intition. For instace, the PR has exceptions for `private_syms`, but how dow we know there are no other exceptions?
> 

At this point of the PR stack I think it's not as clear as the last one, mainly due to how the logic is arranged and how functions are named. But, in a nutshell, what decides whether an allocation for a GPU uses stack memory or shared memory is a combination of the context and its uses.
- Context: Only allocations that are done globally for a target region (e.g. `target private`) or per-team are eligible for shared memory by default. Per-thread allocations (i.e. located inside of `parallel`) remain as stack allocations.
- Uses: Whether an eligible allocation is "promoted" to shared memory is based on whether it's accessed or could potentially be accessed by multiple threads (i.e. in a `parallel` region).

I'm making the exception for appearances of the value in a `private` clause because that doesn't count as a read or write use. That case is basically defining another allocation. There might be other similar uses that I haven't identified as exceptions, but if we failed to identify them that only means we'd be using more shared memory than we needed to.

> It's been some time ago, but I think I meant something like this:
> 
> ```c
> void foobar(void *x) {
>   // access *x
> }
> 
> #pragma omp target private(x) {
> {
>   #pragma omp teams {
>     foobar(&x);
>   }
> }
> ```
> 
> `foobar` (the "indirectably reached region") could not access the shared_alloca `x` without having it passed as a pointer as an argument. And `foobar` does not need to know where the memory comes from. It was referring to the second paragraph of the .td doc of StackToShared.

The second paragraph of that description is talking about this other type of case:
```c++
void foobar() {
  int x = 0;
  #pragma omp parallel for reduction(+: x)
  for (int i = 0; i < 10; ++i)
    x += i;
}

#pragma omp target [teams]
{
  foobar();
}
```
There, we need to identify the allocation for `x` in `foobar` as one that actually needs shared memory to function properly. The reason being that it's reached from the main target thread or the main thread for its team and then expected to be accessible across threads in that team.

I'm not doing any inter-procedural analysis here to ensure this, but we do know for sure that if the function still exists, then it can only be reached from a target region because it does not contain a target region itself. The only case where I'm not sure if this does the expected/right thing is in the case of nested `parallel` regions, like:
```c++
void foobar() { /*same content as above...*/ }

#pragma omp target...
{
  #pragma omp parallel
  {
    foobar();
  }
}
```
There, we'd already be in a `parallel` context when we reached the function, and we'd have all these threads request shared memory for the nested parallel loop. Clang appears to also use shared memory in that particular test, but this is probably something to investigate deeper as follow-up work.

https://github.com/llvm/llvm-project/pull/161863