[Mlir-commits] [mlir] [mlir][gpu] Deprecate gpu::Serialization* passes. (PR #65857)

Tue Sep 12 18:42:29 PDT 2023

================
@@ -78,11 +78,13 @@ void mlir::sparse_tensor::buildSparseCompiler(
 
   // Finalize GPU code generation.
   if (gpuCodegen) {
-#if MLIR_GPU_TO_CUBIN_PASS_ENABLE
-    pm.addNestedPass<gpu::GPUModuleOp>(createGpuSerializeToCubinPass(
-        options.gpuTriple, options.gpuChip, options.gpuFeatures));
-#endif
+    GpuNVVMAttachTargetOptions nvvmTargetOptions;
+    nvvmTargetOptions.triple = options.gpuTriple;
+    nvvmTargetOptions.chip = options.gpuChip;
+    nvvmTargetOptions.features = options.gpuFeatures;
+    pm.addPass(createGpuNVVMAttachTarget(nvvmTargetOptions));
     pm.addPass(createGpuToLLVMConversionPass());
+    pm.addPass(createGpuModuleToBinaryPass());
----------------
fabianmcg wrote:

Oh, I see, I think I know what's happening: the tests use something like `chip=sm_80` and are running on a device with a lower compute capability (`sm_70`). 

Background: The compute capability was never utilized at any point in `--gpu-to-cubin`, and it always compiled the code for the arch found at compile time, that's why the tests never gave issues before.

With this new method, there are 2 mechanisms:
- `format=bin` compiles only for the specified chip and if there's an arch miss-match then a `CUDA_ERROR_NO_BINARY_FOR_GPU` is thrown.
-  `format=fatbin` generates a binary for the specified chip and also embeds the PTX, allowing the driver to JIT the code. However, something I found recently is that the driver can only JIT to a higher CC, so for example `chpi=sm_50` can run on `sm_80`, but `chip=sm_80` cannot run on `sm_50` and in this case one also gets `CUDA_ERROR_NO_BINARY_FOR_GPU`.

NOTE: `fatbin` is the default format.

The issue is this line [sparse-matmul-lib.mlir#L5](https://github.com/llvm/llvm-project/blob/main/mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-matmul-lib.mlir#L5).

How to solve? Use the lowest arch required for the test, that's why most integration tests in trunk use `sm_50`.

Please let me know, if the above works. I think it will, because when I ran the tests on an A100  those tests passed.

Wrt to blaze, I'm working on the JIT only path, so stay tuned.

https://github.com/llvm/llvm-project/pull/65857