[llvm] RFC: [Offload] Design for async error handling (PR #155596)

Wed Oct 1 06:58:22 PDT 2025

RossBrunton wrote:

Going to close this, as I don't see myself working on this any time soon. I was originally going to work on implementing an example in the AMDGPU rtl, but never got around to starting it. I'll detail what I was planning on doing, although I can't say for sure this can actually be implemented.

Basically, every queue has an "error" signal. When a task is complete, it either decrements the "success" signal (which is also the input trigger for the next task) or the "error" signal (for which only one exists per queue. `olSyncQueue` and friends wait on both the "error" signal and the "success" signal from the final task, and can use which of the two signals it got to determine whether it encountered an error or not.

This means that a task failing effectively "skips the queue" and causes the entire pipeline to stop, but also allows olSyncQueue to actually terminate rather than hang.

The weird dependency system between queues means that this failure signal is still sent to the "wait"-ing queue when if the failing queue happens to encounter an error. I'm not 100% sure on how to implement that part.

https://github.com/llvm/llvm-project/pull/155596