[llvm] Add Dead Block Elimination to NVVMReflect (PR #144171)

Tue Jun 17 14:57:47 PDT 2025

Artem-B wrote:

You're arguing about the degree of optimization, but the fact that you do the optimization remains.

> It's hard to say why you haven't run into a bug with this. It might be that running llc with -O0 is uncommon for NVPTX. I think we had a bug internally that was the result of this a couple years ago.

The only code NVVMReflect operates on in practice is CUDA's libdevice. The way libdevice is linked into the CUDA module is on as-needed basis, so we only pull in the functions that are actually needed by the module. So, it's quite possible that we never compiled any function that pulls in uncompileable IR from the libdevice with `-O0`. That's another point to support my assertion that the issue is a corner case, and that adding an escape hatch knob to add the DCE pass should be both sufficient, and minimally invasive. It's a reasonable (IMO) trade-off for the user, whether they want pristine `-O0`, but uncompileable result (I'll address this below) or working IR, but with DCE applied even with `-O0`.

As for compile-ability of the IR that relies on NVVMReflect, this is the same issue as an attempt to compile the code that looks like this:
```
void foo () {
if (__X86__)
  asm ("some x86 asm");
else 
   asm("some other CPU assembly")
}
```

Should this code be expected to compile to something sensible with -O0? With any other optimization option? I would argue that the answer is "no" and that the same argument applies to libdevice and the use of nvvm_reflect(), if one of the if branches contains uncompileable code. We can make ther example above work with C++17  `if constexpr` but we have no such mechanism on the IR level.

Considering that libdevice and nvvm_reflect() do exist, I'm OK with giving the user an option to make it work, with reasonable trade-offs), but I do not want to create an illusion that it is something we want to support. NVVMReflect's attempt to be a n IR-level preprocessor is broken by design, IMO. It happens to work most of the time, but it implicitly relies on things IR does not guarantee. I do not think we want to add such guarantees.

If for some reason we do want to provide a library which does contain IR for different targets, it should be a separate per-target IR blobs, IMO. These days Clang does support LTO and knows how to package multiple IR variants into an object file, and linkt the right blobs together at the end, so we have a better mechanism for providing GPU-side libraries.

@jhuber6 WDYT? 

https://github.com/llvm/llvm-project/pull/144171