[PATCH] D133956: git push Cuda Check for ignored errors after calling a CUDA kernel

Thu Sep 15 12:03:45 PDT 2022

tra added a comment.

> Our linter cannot easily check whether the error was reset

It can not in principle. Many CUDA errors are 'sticky' and can only be cleared by resetting the GPU or exiting the application and the former is virtually never used beyond toy examples (resetting a GPU would clear a lot of state, including memory allocations and restoring it is usually not feasible in practice). 
E.g:
https://stackoverflow.com/questions/43659314/how-can-i-reset-the-cuda-error-to-success-with-driver-api-after-a-trap-instructi
https://stackoverflow.com/questions/56329377/reset-cuda-context-after-exception/56330491

The checker has no way to tell whether the returned error is sticky and the stickiness can start at any CUDA runtime call, so, generally speaking, all CUDA API calls must be checked and any of them may be the one producing sticky errors due to preceding calls. At the very minimum, in addition to `<<<...>>>` kernel launches, user may also launch kernels via `cudaLaunchKernel()` and, I believe, CUDA runtime itself may launch some helper kernels under the hood, so I would not be surprised to see other sources of errors.

I think ultimately the checker should be generalized to flag all unchecked CUDA runtime calls. The problem is that that is going to be exceedingly noisy in practice as a lot of real code does not bother to check for the errors consistently. Limiting the checks to kernel launches may be a reasonable starting point as it would give us the ability to zero in on the culprit kernel by running the app with "CUDA_LAUNCH_BLOCKING".

================
Comment at: clang-tools-extra/test/clang-tidy/checkers/cuda/unsafe-kernel-call-macro-handler.cu:53
+  }
+  auto err = cudaGetLastError();
+
----------------
Just curious -- is it sufficient to just call `cudaGetLastError();` ? Or does the checker require using its result, too? I.e. in practice this particular check will not really do anything useful. The tests below look somewhat inconsistent. 

================
Comment at: clang-tools-extra/test/clang-tidy/checkers/cuda/unsafe-kernel-call-macro-handler.cu:75
+  // CHECK-MESSAGES: :[[@LINE-1]]:{{[0-9]+}}: warning: Possible unchecked error after a kernel launch.
+  // Calling an error-checking function after a kernel is not considered safe.
+  errorCheck();
----------------
WDYM by "is not considered safe" here? How is that different from calling `cudaGetLastError()` and checking its value?

================
Comment at: clang-tools-extra/test/clang-tidy/checkers/cuda/unsafe-kernel-call-macro-handler.cu:80-82
+  b<<<1, 2>>>();; /* The semicolons are here because the
+    detection of the macro is done with a lexer */ ;
+  CUDA_CHECK_KERNEL();
----------------
What would happen with a single `;` as would be seen in the normal user code?

================
Comment at: clang-tools-extra/test/clang-tidy/checkers/cuda/unsafe-kernel-call-macro-handler.cu:112
+  do {
+    auto err2 = cudaGetLastError();
+  } while(0);
----------------
Why does this case produce no warning, while a very similar case above does? In both cases result of `cudaGetLastError()` is assigned to an unused variable within the loop body.

```
  b<<<1, 2>>>();
  // CHECK-MESSAGES: :[[@LINE-1]]:{{[0-9]+}}: warning: Possible unchecked error after a kernel launch.
  for(;;)
    auto err2 = cudaGetLastError(); // Brackets omitted purposefully, since they create an additional AST node

```

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133956/new/

https://reviews.llvm.org/D133956