[llvm] [AMDGPU] Disable atomic optimization of fadd/fsub with result (PR #96479)

Wed Jul 3 08:22:36 PDT 2024

jayfoad wrote:

> We have very important supercomputer customers waiting for this who are going to be dissatisfied if it only works when the result is not used. We need an optimized implementation for the returned result case. If the CTS broken makes this happen faster then I think it should remain broken.

The golden rule of compiler development is correctness trumps performance. You're welcome to have a fast but broken implementation downstream. Or we could work together on fixing the fast path so it is also correct :)

> Well, I'm not sure which implementation of the optimization we're talking about. Is it the WMM or the other or both?

The bug was in the uniform path, which does not need to use generate any DPP or "Iterative" code. The fix was to treat uniform inputs the same as divergent inputs, so they *will* generate some DPP or "Iterative" code.

https://github.com/llvm/llvm-project/pull/96479