[PATCH] D112760: Require 'contract' fast-math flag for FMA generation

Tue Nov 2 16:20:37 PDT 2021

tra added a subscriber: jdoerfert.
tra added inline comments.

================
Comment at: clang/test/CodeGenCUDA/fp-contract.cu:3-7
+// FIXME: This test fails. The comment below describes broken behavior.
+//        The front end should generate IR for the semantics it expects and
+//        backends should respect the IR. Backends should never "disregard"
+//        elements of the IR.
+
----------------
andrew.w.kaylor wrote:
> tra wrote:
> > andrew.w.kaylor wrote:
> > > tra wrote:
> > > > andrew.w.kaylor wrote:
> > > > > tra wrote:
> > > > > > We do need to have a way to preserve current behavior for CUDA compilation. There are many existing users that implicitly assume it.
> > > > > > 
> > > > > > 
> > > > > When you say you want to preserve "the current behavior" do you mean using "fast" as the default "fp-contract" setting, or also ignoring fp-contract-related pragmas when fp-contract=fast is used?
> > > > > 
> > > > > I certainly understand wanting fp-contract=fast to be the default behavior, but I'm puzzled by the difference in behavior that was introduced between HIP and CUDA, wherein HIP respects the pragmas but CUDA doesn't.
> > > > The idea is not to disturb the status quo. Major CUDA users are sort of used to clang being reasonably close to what NVCC does by default. What that is, exactly is not always clear. The current state of affairs has been working well enough. Changing how FP gets compiled will likely trigger a noticeable number of test failures due to both numerical differences and performance regressions. Former we have somewhat decent coverage for in tensorflow. Performance regressions would be harder to spot.
> > > > 
> > > > I can test the patch on our tensorflow tests and see how it fares.
> > > > 
> > > > If there are nontrivial failures, we will need to consider how to phase in the changes w/o causing unnecessary trouble for the users and/or give then an escape hatch option to keep things working until they can fix their code or tests.
> > > > 
> > > > >  puzzled by the difference in behavior that was introduced between HIP and CUDA
> > > > 
> > > > The details on HIP's need for fp-honor-pragma is in https://github.com/llvm/llvm-project/commit/cb08558caa3bad69213b08e6361586491232c745
> > > > 
> > > > For CUDA things were still working well enough with -ffp-contract=fast, so there was no need to change things.
> > > > 
> > > What I'd like to understand is whether CUDA requires ignoring the pragma when fp-contract=fast is set or if it just needs to use fp-contract=fast by default and doesn't mind that the pragma is ignored. I understand why HIP would want to honor the pragma, and I'd like that to be the normal behavior of fp-contract=fast for all targets.
> > > 
> > > I see that CUDA does respect "#pragma clang fp contract(off)" as a way to disable contraction if the global setting is "fp-contract=on" (https://godbolt.org/z/4d7En36En), so I don't understand why we wouldn't want the pragma to also work with "fp-contract=fast".
> > > 
> > > Also, Zahira Ammarguellat is working on a patch to align the clang behavior and documentation (https://reviews.llvm.org/D107994). We're trying not to break the CUDA behavior in the process. Could you take a look at that patch and provide feedback? Thanks!
> > > What I'd like to understand is whether CUDA requires ignoring the pragma when fp-contract=fast is set 
> > 
> > I don't think so, but I'm not sure. I think ignoring the pragmas was the unfortunate side-effect of `backend with 'Fast' fp fuse option does not respect contract flag`, mentioned in D90174. Whether someone happens to rely on it is hard to tell. 
> > 
> > > or if it just needs to use fp-contract=fast by default and doesn't mind that the pragma is ignored.
> > 
> > I think this is roughly the case. We used `fp-contract=fast` because it matched what we get from nvcc, which does ignore clang pragmas: https://godbolt.org/z/fGW33fo4v
> > 
> > If that behavior changes, it *may* be visible to CUDA users. On the other hand, it should not be widespread and, arguably, clang pragmas that may come from clang headers or the standard library should be respected.
> > 
> > > I see that CUDA does respect "#pragma clang fp contract(off)" as a way to disable contraction if the global setting is "fp-contract=on" (https://godbolt.org/z/4d7En36En), so I don't understand why we wouldn't want the pragma to also work with "fp-contract=fast".
> > 
> > We want GPU-side code to behave the same as the host-side code. For GPUs we rely on some math functions to be provided by the standard library and others are provided by CUDA headers and libdevice bitcode. I suspect that may be one of the reasons for CUDA compilation to  ignore clang pragmas, because they would not be present in the GPU-side variants.
> > 
> > AFAICT, it's not that we want `fp-contract=fast` to ignore pragmas, but that it just conveniently happened to enable fma everywhere.
> > We could use a separate option to disable fp contract pragmas for that and let `fp-contract=fast` honor the pragmas.
> It's not surprising that nvcc would ignore the clang-specific pragma. Also not surprising, if I use "#pragma STDC FP_CONTRACT OFF" instead it doesn't respect that either (though in that case I get a warning saying, "unrecognized STDC pragma").
> 
> It seems very likely that CUDA developers just won't use these pragmas. If for whatever reason they do, I think it makes sense to have the pragma behave the same as it would for other targets. I'd also be OK with having clang issue a diagnostic and always disregard the pragma when compiling CUDA code (though I don't think that's the best choice). The current situation where the pragma works with fp-contract=on but not with fp-contract=fast just seems like a bug to me. Basically what I'm saying is that when compiling CUDA code, we should either not allow the pragma or it should work.
> 
> One of the things I'm trying to accomplish with this patch is to fix the target-independent backend to have it generate instructions that implement the semantics of the IR produced by the front end. So if the IR says no contract, there would be no global option that would cause the backend to do otherwise. The upshot of this would be that the pragma could be used to disable contraction, even if fp-contract=fast were used as the base setting for the compilation unit.
> 
> I feel very strongly that this is the way the backend should work relative to the IR. It sounds like we can have the front end continue working as it does for all targets, including HIP and CUDA (which would be my preference). This would effectively eliminate the need for the fp-contract=fast-honor-pragmas setting, since all targets would always honor the pragmas for fp-contract=fast.
> 
> Are there other CUDA stakeholders that I should reach out to for feedback on this?
I agree with your argument in principle.

I also think that we do need a special case to preserve this CUDA quirk. I'm fine with making the standard options behave the way you describe.
But we do need to give CUDA users an escape hatch option to ignore pragmas if they run into issues. 

Clang is in an unfortunate position that we want host and GPU code to behave identically, but parts of the GPU-side implementation is provided by NVIDIA and we can't change it. The only practical option I see is to have a way to preserve the current behavior when we have to.

Without the escape hatch option, the only recourse we'll have is to unroll the patch.
FP compilation changes are known to bring surprises, so the question is "what we're going to do about the issues", not whether we'll see such issues. IMO an escape hatch combined with incremental follow-up improvements is a better strategy compared to multiple patch/revert cycles.

> Are there other CUDA stakeholders that I should reach out to for feedback on this?

For CUDA front-end, it's probably myself and @yaxunl as HIP shares most of the front-end functionality. OpenMP folks' (@jdoerfert  ?) work overlaps some of the areas (they use NVPTX back-end). They probably would have more input on numerics, but they likely have different constraints as OpenMP doesn't need to match NVCC.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112760/new/

https://reviews.llvm.org/D112760