[PATCH] D133956: git push Cuda Check for ignored errors after calling a CUDA kernel

Thu Sep 15 14:59:26 PDT 2022

barcisz added a comment.

In D133956#3793022 <https://reviews.llvm.org/D133956#3793022>, @tra wrote:

> I think ultimately the checker should be generalized to flag all unchecked CUDA runtime calls. The problem is that that is going to be exceedingly noisy in practice as a lot of real code does not bother to check for the errors consistently. Limiting the checks to kernel launches may be a reasonable starting point as it would give us the ability to zero in on the culprit kernel by running the app with "CUDA_LAUNCH_BLOCKING".

By that do you mean that the way the check is now it is acceptable or that it should be improved to handle intra-procedural analysis? The intention with this check is to work a lot in tandem with the one in D133804 <https://reviews.llvm.org/D133804>, which therefore prevents most such cases. The practice that the check checks for is also commonly used in ML frameworks which heavily rely on CUDA, so not catching such cases might still be helpful for them just for the sake of preserving code consistency and catching such errors (since if there is an error then that means that some part of the code was broken anyways). Thus, the check is optimized for lowering false positives during static checking and for a practice lowering the number of false negatives within the CUDA code.

================
Comment at: clang-tools-extra/test/clang-tidy/checkers/cuda/unsafe-kernel-call-macro-handler.cu:53
+  }
+  auto err = cudaGetLastError();
+
----------------
tra wrote:
> Just curious -- is it sufficient to just call `cudaGetLastError();` ? Or does the checker require using its result, too? I.e. in practice this particular check will not really do anything useful. The tests below look somewhat inconsistent. 
Technically it does not require the user to actually use the value of `cudaGetLastError()`, but

  # If they are calling it then they most likely did not place this call there randomly and are using it to check for the error returned by the kernel
  # the check being introduced in D133804 can be used to check if the return value has been used, so checking it here as well would have been a duplication

================
Comment at: clang-tools-extra/test/clang-tidy/checkers/cuda/unsafe-kernel-call-macro-handler.cu:75
+  // CHECK-MESSAGES: :[[@LINE-1]]:{{[0-9]+}}: warning: Possible unchecked error after a kernel launch.
+  // Calling an error-checking function after a kernel is not considered safe.
+  errorCheck();
----------------
tra wrote:
> WDYM by "is not considered safe" here? How is that different from calling `cudaGetLastError()` and checking its value?
As in the check does not do inter-procedural analysis short of adding the handler to AcceptedHandlers, so the check will flag such occurences

================
Comment at: clang-tools-extra/test/clang-tidy/checkers/cuda/unsafe-kernel-call-macro-handler.cu:80-82
+  b<<<1, 2>>>();; /* The semicolons are here because the
+    detection of the macro is done with a lexer */ ;
+  CUDA_CHECK_KERNEL();
----------------
tra wrote:
> What would happen with a single `;` as would be seen in the normal user code?
Nothing, it would work just fine; it's rather that all other kernel calls in this test use a single `;` so I want to check this case here

================
Comment at: clang-tools-extra/test/clang-tidy/checkers/cuda/unsafe-kernel-call-macro-handler.cu:112
+  do {
+    auto err2 = cudaGetLastError();
+  } while(0);
----------------
tra wrote:
> Why does this case produce no warning, while a very similar case above does? In both cases result of `cudaGetLastError()` is assigned to an unused variable within the loop body.
> 
> ```
>   b<<<1, 2>>>();
>   // CHECK-MESSAGES: :[[@LINE-1]]:{{[0-9]+}}: warning: Possible unchecked error after a kernel launch.
>   for(;;)
>     auto err2 = cudaGetLastError(); // Brackets omitted purposefully, since they create an additional AST node
> 
> ```
Because often a macro will wrap its error handling code in a do {...} while(0) loop and that's why we check this case and simmilar ones with CFG analysis

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133956/new/

https://reviews.llvm.org/D133956