[Openmp-commits] [openmp] [Libomptarget] Fix JIT on the NVPTX target by calling ptx manually (PR #77801)

Thu Jan 11 09:23:58 PST 2024

https://github.com/jhuber6 created https://github.com/llvm/llvm-project/pull/77801

Summary:
Recently a patch added an assertion in the GlobalHandler to indicate
when an ELF was not used. This began to fire whenever NVPTX JIT was
used, because the JIT pass output a PTX file instead of an ELF. The
CUModuleLoad method consumes `.s` internally and compiles it to a cubin,
however, this is too late as we perform several checks on the ELF
directly for the presence of certain symbols and to read some necessary
constants. This results in inconsistent behaviour.

To address this, this patch simply calls `ptxas` manually, similar to
how `lld` is called for the AMDGPU JIT pass. This is inevitably going to
be slower than simply passing it to the CUDA routine due to the overhead
involved in file IO and a fork call, but it's necessary for correctness.

CUDA provides an API for compiling PTX manually. However, this only
started showing up in CUDA 11.1 and is only provided "officially" in a
static library. The `libnvidia-ptxjitcompiler.so` next to the CUDA
driver has the same symbols and can likely be used as a replacement.
This would be the faster solution. However, given that it's not
documented it may have some issues.


>From 64d0aaa65840d33e4bcfa15e72b5731f48e062ac Mon Sep 17 00:00:00 2001
From: Joseph Huber <huberjn at outlook.com>
Date: Thu, 11 Jan 2024 11:19:25 -0600
Subject: [PATCH] [Libomptarget] Fix JIT on the NVPTX target by calling ptx
 manually

Summary:
Recently a patch added an assertion in the GlobalHandler to indicate
when an ELF was not used. This began to fire whenever NVPTX JIT was
used, because the JIT pass output a PTX file instead of an ELF. The
CUModuleLoad method consumes `.s` internally and compiles it to a cubin,
however, this is too late as we perform several checks on the ELF
directly for the presence of certain symbols and to read some necessary
constants. This results in inconsistent behaviour.

To address this, this patch simply calls `ptxas` manually, similar to
how `lld` is called for the AMDGPU JIT pass. This is inevitably going to
be slower than simply passing it to the CUDA routine due to the overhead
involved in file IO and a fork call, but it's necessary for correctness.

CUDA provides an API for compiling PTX manually. However, this only
started showing up in CUDA 11.1 and is only provided "officially" in a
static library. The `libnvidia-ptxjitcompiler.so` next to the CUDA
driver has the same symbols and can likely be used as a replacement.
This would be the faster solution. However, given that it's not
documented it may have some issues.
---
 .../plugins-nextgen/cuda/src/rtl.cpp          | 62 +++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
index fb51f96c07b74d..7ccd0b4fef2840 100644
--- a/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
@@ -28,6 +28,9 @@
 #include "llvm/Frontend/OpenMP/OMPConstants.h"
 #include "llvm/Frontend/OpenMP/OMPGridValues.h"
 #include "llvm/Support/Error.h"
+#include "llvm/Support/FileOutputBuffer.h"
+#include "llvm/Support/FileSystem.h"
+#include "llvm/Support/Program.h"
 
 namespace llvm {
 namespace omp {
@@ -397,6 +400,65 @@ struct CUDADeviceTy : public GenericDeviceTy {
     return callGlobalCtorDtorCommon(Plugin, Image, /*IsCtor=*/false);
   }
 
+  Expected<std::unique_ptr<MemoryBuffer>>
+  doJITPostProcessing(std::unique_ptr<MemoryBuffer> MB) const override {
+    // TODO: We should be able to use the 'nvidia-ptxjitcompiler' interface to
+    //       avoid the call to 'ptxas'.
+    SmallString<128> PTXInputFilePath;
+    std::error_code EC = sys::fs::createTemporaryFile("nvptx-pre-link-jit", "s",
+                                                      PTXInputFilePath);
+    if (EC)
+      return Plugin::error("Failed to create temporary file for ptxas");
+
+    // Write the file's contents to the output file.
+    Expected<std::unique_ptr<FileOutputBuffer>> OutputOrErr =
+        FileOutputBuffer::create(PTXInputFilePath, MB->getBuffer().size());
+    if (!OutputOrErr)
+      return OutputOrErr.takeError();
+    std::unique_ptr<FileOutputBuffer> Output = std::move(*OutputOrErr);
+    llvm::copy(MB->getBuffer(), Output->getBufferStart());
+    if (Error E = Output->commit())
+      return std::move(E);
+
+    SmallString<128> PTXOutputFilePath;
+    EC = sys::fs::createTemporaryFile("nvptx-post-link-jit", "cubin",
+                                      PTXOutputFilePath);
+    if (EC)
+      return Plugin::error("Failed to create temporary file for ptxas");
+
+    // Try to find `ptxas` in the path to compile the PTX to a binary.
+    const auto ErrorOrPath = sys::findProgramByName("ptxas");
+    if (!ErrorOrPath)
+      return Plugin::error("Failed to find 'ptxas' on the PATH.");
+
+    std::string Arch = getComputeUnitKind();
+    StringRef Args[] = {*ErrorOrPath,
+                        "-m64",
+                        "-O2",
+                        "--gpu-name",
+                        Arch,
+                        "--output-file",
+                        PTXOutputFilePath,
+                        PTXInputFilePath};
+
+    std::string ErrMsg;
+    if (sys::ExecuteAndWait(*ErrorOrPath, Args, std::nullopt, {}, 0, 0,
+                            &ErrMsg))
+      return Plugin::error("Running 'ptxas' failed: %s\n", ErrMsg.c_str());
+
+    auto BufferOrErr = MemoryBuffer::getFileOrSTDIN(PTXOutputFilePath.data());
+    if (!BufferOrErr)
+      return Plugin::error("Failed to open temporary file for ptxas");
+
+    // Clean up the temporary files afterwards.
+    if (sys::fs::remove(PTXOutputFilePath))
+      return Plugin::error("Failed to remove temporary file for ptxas");
+    if (sys::fs::remove(PTXInputFilePath))
+      return Plugin::error("Failed to remove temporary file for ptxas");
+
+    return std::move(*BufferOrErr);
+  }
+
   /// Allocate and construct a CUDA kernel.
   Expected<GenericKernelTy &>
   constructKernel(const __tgt_offload_entry &KernelEntry) override {