[PATCH] D133956: git push Cuda Check for ignored errors after calling a CUDA kernel

Bartłomiej Cieślar via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Thu Sep 15 10:44:16 PDT 2022


barcisz created this revision.
Herald added subscribers: mattd, carlosgalvezp, yaxunl, mgorny.
Herald added a project: All.
barcisz requested review of this revision.
Herald added a project: clang-tools-extra.
Herald added a subscriber: cfe-commits.

Add cuda-unchecked-kernel-call check
====================================

Motivation
----------

Calls to CUDA kernels can yield errors after their invocation. These errors can be obtained by calling `cudaGetLastError()`, which also resets CUDA’s error state. There is a non error-resetting version of this function called `cudaPeekAtLastError()`, but the lint check does not accept this (see 
below). A limited set of errors can block a kernel from launching including driver malfunctions, **trying to allocate too much shared memory**, or using too many threads or blocks. Since those errors can cause unexpected behavior that blocks subsequent computation, they should be caught as close 
to the launch point as possible. The lint check enforces this by requiring that every kernel be immediately followed by an error check.

Behavior
--------

The **cuda-unchecked-kernel-call** checks whether there is a call to `cudaGetLastError()` directly after each kernel call. To be precise, there can be no side-effecting or branching code between the kernel call and the call to `cudaGetLastError()`, such as branching due to the `?:` operator or due 
to a call to a function. This is because a more complicated behavior is likely to be harder for humans to read and would would be significantly slower to automatically check. We want to encourage well-designed, multi-line macros that check for errors, so we explicitly allow macros whose content is 
`do { /* error check */ } while(false)`, since this is the recommended way <https://wiki.sei.cmu.edu/confluence/display/c/PRE10-C.+Wrap+multistatement+macros+in+a+do-while+loop> of making multi-line macros.
The check does also accept the handler it was provided as a valid way to handle the error, even if the handler does not comply with the rule above (or is a function which cannot be easily and quickly checked). However, it is still encouraged to call `cudaGetLastError()` early in the handler’s code 
for the code to be readable.

Automatic fixes
---------------

The lint check can be configured to automatically fix the issue by adding an error handling macro right after the kernel launch. You can specify the error handler for your project by setting the **HandlerName** option for the **cuda-unchecked-kernel-call**. Here is an example of how this fix can 
transform unhandled code from:

  void foo(bool b) {
    if (b)
      kernel<<<x, y>>>();
  }

to

  void foo(bool b) {
    if(b)
      {kernel<<<x, y>>>(); `C10_CUDA_KERNEL_LAUNCH_CHECK`();}
  }

The specific handler used for this example is taken from PyTorch and its definition can be found here <https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDAException.h>.

Known Limitations
-----------------

Using cudaPeekAtLastError()
---------------------------

`cudaPeekAtLastError()` can also be used to check for CUDA kernel launch errors. However, there are several reasons why this is not and will most likely not be considered as a valid way to check for errors after kernel invocations. This all has to do with the purpose of the function, which is to 
not reset the internal error variable:

- Subsequent kernel calls, even if they don’t produce any errors, will seem as if they produced an error due to the error not being reset. This behavior is easy to overlook and may cause he significant difficulty in debugging.
- Our linter cannot easily check whether the error was reset before subsequent kernel calls. It might even be impossible to do so due to the error leaking inter-procedurally from functions whose code we can’t access.

Checking for errors that occurred while a kernel was running
------------------------------------------------------------

Our linter does not check whether errors occurred while a kernel was running. The linter only enforces checks that a kernel launched correctly. `cudaDeviceSynchronize()` and similar API calls can be used to see that a kernel’s computation was successful, but these are blocking calls, so we are not 
able to suggest where they should go automatically.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D133956

Files:
  clang-tools-extra/clang-tidy/cuda/CMakeLists.txt
  clang-tools-extra/clang-tidy/cuda/CudaTidyModule.cpp
  clang-tools-extra/clang-tidy/cuda/UnsafeKernelCallCheck.cpp
  clang-tools-extra/clang-tidy/cuda/UnsafeKernelCallCheck.h
  clang-tools-extra/clang-tidy/utils/FixItHintUtils.cpp
  clang-tools-extra/docs/ReleaseNotes.rst
  clang-tools-extra/docs/clang-tidy/checks/cuda/unsafe-kernel-call.rst
  clang-tools-extra/docs/clang-tidy/checks/list.rst
  clang-tools-extra/test/clang-tidy/checkers/Inputs/Headers/cuda/cuda.h
  clang-tools-extra/test/clang-tidy/checkers/Inputs/Headers/cuda/cuda_runtime.h
  clang-tools-extra/test/clang-tidy/checkers/cuda/unsafe-kernel-call-function-handler.cu
  clang-tools-extra/test/clang-tidy/checkers/cuda/unsafe-kernel-call-macro-handler.cu

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D133956.460455.patch
Type: text/x-patch
Size: 31959 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20220915/0a3072f4/attachment-0001.bin>


More information about the cfe-commits mailing list