[Openmp-commits] [openmp] [OpenMP][NFC] Remove `DelayedBinDesc` (PR #74360)

Johannes Doerfert via Openmp-commits openmp-commits at lists.llvm.org
Wed Dec 6 12:01:57 PST 2023

jdoerfert wrote:

As promised, I spend some time on this.

- The code that I delete is neither dead nor racy. It's not racy because there is only one thread involved, so that thread circles back after `PM = new...` and adds a image into the `DeleayedBinDesc` container. This is also the reason it is not strictly speaking dead either.
- That said, the code (today) simply adds the image into the `DelayedBinDesc` and then pops them out, so let's see what happens and why we don't need this anymore, and, while we are at it, also why it was needed before.



HSA is found at build time so we do not use the dlwrapper.
On the call that was said to be "the requirement" but I don't believe so.
The dlopen in the issue (#60119) is not the dlwrapper but just a dlopen from HSA.
It also seems I can reproduce this just fine in the dlwrapper and non-dlwrapper setting.


Let's start with the fix:
/d/s/d/r/s/llvm-project ❯❯❯ gcs --name-only
commit 2257e3d2e55a52d70db10e9f4ed1669ab79ace3f
Author: Jon Chesterfield <jonathanchesterfield at gmail.com>
Date:   Sat Jan 21 12:01:13 2023 +0000

    [openmp] Workaround for HSA in issue 60119

Verifying it breaks (so we go one commit back):
/d/s/d/r/s/llvm-project ❯❯❯ git reset --hard HEAD~
/d/s/d/r/s/llvm-project ❯❯❯ cat bug60119.c
int main() {}
/d/s/d/r/s/llvm-project ❯❯❯ touch empty.c
/d/s/d/r/s/llvm-project ❯❯❯ clang -fopenmp --offload-arch=gfx90a -fPIC -shared empty.c -o liba.so
/d/s/d/r/s/llvm-project ❯❯❯ clang -fopenmp --offload-arch=gfx90a -fPIC -shared empty.c -o libb.so
/d/s/d/r/s/llvm-project ❯❯❯ clang -fopenmp --offload-arch=gfx90a -L . -la -lb bug60119.c -o bug60119
/d/s/d/r/s/llvm-project ❯❯❯ LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH  ./bug60119

Yes, it hangs.

Observation 1:

It goes away and works fine with rocm 5.5.0 and later while rocm 5.4.3 and before breaks.


Setup con't:

The system has an old `libomptarget.rtl.cuda.so.15` which fails to load:
omptarget --> Successfully loaded library 'libomptarget.rtl.cuda.so'!
omptarget --> Invalid plugin as necessary interface function (init_plugin) was not found.

Observation 2:

Without this commit (on top of trunk):
/d/s/d/r/s/llvm-project ❯❯❯ gcs --name-only
  commit 1f283a60a4bb896fa2d37ce00a3018924be82b9f (HEAD -> offload_prep7, llvm/main)
  Author: Matt Arsenault <Matthew.Arsenault at amd.com>
  Date:   Fri Nov 10 10:31:40 2023 +0900

      Reapply "RegisterCoalescer: Add implicit-def of super register when coalescing SUBREG_TO_REG"

It just hangs:
#1  0x000015554c0e77a7 in __gthread_once (__func=<optimized out>, __once=0x15554c29d940 <InitializeMemorySSAWrapperPassPassFlag>)
    at .../include/g++/x86_64-redhat-linux/bits/gthr-default.h:700
#2  std::call_once<void* (&)(llvm::PassRegistry&), std::reference_wrapper<llvm::PassRegistry> > (
    __f=@0x15554c0e6280: {void *(llvm::PassRegistry &)} 0x15554c0e6280 <initializeMemorySSAWrapperPassPassOnce(llvm::PassRegistry&)>, __once=...) at .../gcc/12.2.0/snos/include/g++/mutex:859
#3  llvm::call_once<void* (&)(llvm::PassRegistry&), std::reference_wrapper<llvm::PassRegistry> > (
    F=@0x15554c0e6280: {void *(llvm::PassRegistry &)} 0x15554c0e6280 <initializeMemorySSAWrapperPassPassOnce(llvm::PassRegistry&)>, flag=...)
    at .../llvm-project/llvm/include/llvm/Support/Threading.h:89
#15 llvm::omp::target::JITEngine::JITEngine(llvm::Triple::ArchType) () from .../build/llvm/lib/libomptarget.rtl.amdgpu.so

However, this seems unrelated.

Observation 3:

Replacing `libomptarget.rtl.cuda.so.15` with an empty file `libomptarget.rtl.cuda.so`, it works fine. So,
touch libomptarget.rtl.cuda.so
works fine after:
omptarget --> Attempting to load library 'libomptarget.rtl.cuda.so'...
omptarget --> Unable to load library 'libomptarget.rtl.cuda.so': ./libomptarget.rtl.cuda.so: file too short!

The reason is that `libomptarget.rtl.cuda.so.15` will initialize a different LLVM and that doesn't work.

==> *Don't mix LLVM versions in one library (with exposed symbols).*

FWIW, with https://github.com/llvm/llvm-project/pull/74538 this will not happen anymore since we will stop picking up random plugins by default.

Observation 4:

Is the code dead?
No. It is executed by the initial thread.
The thread adds libraries to the delayed container, then registers them.
There is no second thread involved, it's just the one thread (as far as I can tell).


Observation 5:

With this commit (on top of head):
/d/s/d/r/s/llvm-project ❯❯❯ gcs --name-only
  commit 369775d1461f1f7402a0fc9a312bb3b701410dd6 (HEAD -> offload_prep7)
  Author: Johannes Doerfert <johannes at jdoerfert.de>
  Date:   Mon Dec 4 11:16:25 2023 -0800

       [OpenMP][NFC] Remove `DelayedBinDesc`

and the empty `libomptarget.rtl.cuda.so` from above it works fine.

I tested all these rocm versions:
4.2.0            4.5.2            5.1.0            5.2.1            5.3.3            5.4.2            5.5.1            5.7.1
4.3.1            5.0.0            5.1.1            5.2.3            5.4.0            5.4.3            5.6.0            5.7.0            
4.5.0            5.0.2            5.2.0            5.3.0            5.4.1            5.5.0            5.6.1



So what happened? Let's actually look at the original bug.
I will explain below what changed in rocm 5.5.0 to make this work (in basically all cases) but let's stay with rocm 5.3.0.

In the old code (before I started to rewrite libomptarget), we created the `RTLInfoTy` (what is now `PluginAdapterTy`) as part of LoadRTLs().
for (const char *Name : RTLNames) {

      RTLInfoTy &RTL = AllRTLs.back();

      const std::string BaseRTLName(Name);
      if (NextGenPlugins) {
        if (attemptLoadRTL(BaseRTLName + ".nextgen.so", RTL))
So, we added an uninitialized `RTLInfoTy ` into `AllRTLS` when we started to actually load the `RTL` (`PluginAdaptorTy`) and hence the plugin.

You can probably imagine this is going to be a problem.

What happens next is the HSA "bug" that was mentioned. 
`libomptarget.rtl.amdgpu.so` loads HSA.
HSA then calls ` rocr::os::GetLoadedLibs()` which already sounds bad (see https://github.com/RadeonOpenCompute/ROCR-Runtime/blame/adae6c61e10d371f7cbc3d0e94ae2c070cab18a4/src/core/util/lnx/os_linux.cpp#L206)
This function seems to be gone in rocm 5.5.x.
Here, HSA will load all runtimes linked to the application, including our `liba` and `libb`.
Once we get there, we call `__tgt_register_lib` which will iterate over `AllRTLs` and unconditionally call `RTLInfoTy::is_valid_binary` (which is necessary for plugins to provide).
While that should not be an issue but `RTL` above is not initialized yet (for the most part).
So we call a `nullptr` and crash.

Remember, the original issue was a hang not a crash.
The move of `loadRTLs` fixed that hang just fine.
And I really mean fixed, since the expected behavior should be that a register call is valid.
The problem after the move is that we put an uninitialized plugin (adaptor) into the list of all plugin adaptors.

So let's track back when we stopped adding half initialized plugin adaptors into the container.
First, I went back before bc4e0c048aa3cd940b0cea787014c7e8680e5040 and applied this patch.
We get *a* hang again (which rocm < 5.5.0).
This time it's in the libomptarget AMDGPU plugin itself since it was never done initializing the plugin.
The global constructor of the `Plugin` in `Plugin::get()` is running, loads HSA which caused it to call back into the plugin and there it waits for the constructor to be done -> hang.
It doesn't hang/crash like before because we setup `is_valid_binary` (and all the other runtime call pointers) before we call `RTL.init_plugin`.
Said differently, `RTL` in `AllRTLs` is not mostly uninitialized anymore, all function pointers are initialized.

So, let's go back before b80b5f180ba64c2d6d9753ce8b310f1759eb2884 and apply this patch.
That's the last commit when we only loaded `RTL.init_plugin` from the plugin shared library and call it before we setup all the other pointers, including `is_valid_binary`. 
Voila, we get the crash we'd expect as we call `is_valid_binary` which is still a nullptr.



0) The original fix solved two problems: a) The move of loadRTLS prevented the reported hang. Recursion into libomptarget did not stop at register_image anymore but would have segfaulted later without the delay buffer on the outermost level. The buffer was needed because the runtime was put into an inconsistent state and it was accessed during that time via the public interface that expects consistent state. 
1) In the call it was mentioned dlwrapper is needed, that is not the case. It was also mentioned that multiple threads need to concurrently initialize libomptarget, which is also not necessary (or possible, as far as I can tell).
2) Since rocm 5.5.0 we don't need any of this anymore. The rocm bug has been fixed, they stopped loading all runtimes unconditionally.
3) Since b80b5f180ba64c2d6d9753ce8b310f1759eb2884 we properly initialize all function pointers in the (now) PluginAdaptorTy before we start using those pointers. That prevents the crash you'd see without the delay buffer. We still will hang.
4) Since bc4e0c048aa3cd940b0cea787014c7e8680e5040 we will not add PluginAdaptors into the container of all plugin adaptors before the plugin is not ready. The error is thereby gone. When and old HSA loads other libraries they can call register_image but that will simply not register the image with the plugin we are currently initializing. That seems like reasonable behavior, thought it is good to keep in mind if we ever want a kernel library (@jhuber6 @mjklemm). We can still have a standalone kernel library though or load it late after all plugins are setup (which seems reasonable).



I don't see a reason this code cannot be deleted, regardless of the rocm version used.
The reproducer is included and it works just fine with rocm 5.3.0.


More information about the Openmp-commits mailing list