[PATCH] D37616: [X86] PR34149 Suboptimal codegen for fast minnum and maxnum.

Sun Sep 10 11:23:47 PDT 2017

spatel added a subscriber: efriedma.
spatel added a comment.

In https://reviews.llvm.org/D37616#865983, @jbhateja wrote:

> In https://reviews.llvm.org/D37616#865981, @spatel wrote:
>
> > In https://reviews.llvm.org/D37616#865798, @jbhateja wrote:
> >
> > > In https://reviews.llvm.org/D37616#865418, @spatel wrote:
> > >
> > > > I might be missing some context here. If we have fast/nnan on these calls, then can't we simplify this in IR to fmp+select and not have to deal with this in the backend? The intrinsics only exist to make sure that NaN behavior in IR meets the higher level standards, so if we have nnan, then we don't need the intrinsic?
> > >
> > >
> > > Intrinsic function defer code geneation/expansion to backend this give backend control over geneating efficient code as per specific target.
> >
> >
> > It's incorrect that intrinsics are passed unaltered to the backend for expansion/optimization. See the optimizations for both generic and target-specific intrinsics in InstCombiner::visitCallInst().
> >
> > Again, I may be missing some context - who created this IR? Creating a 'call fast llvm.maxnum()' just doesn't make sense to me, so if we can fix that in IR, we should do that. The intrinsic inhibits the large number of potential optimizations for fcmp+select that we have in IR. No target should benefit from having extra NaN semantics requirements provided by the intrinsic that are then overridden by FMF.
> >
> > Please split the FlagsAcquirer diff into a separate patch.
>
>
> Your point of not genrating call fast @llvm.minnum in the first place and instead fcmp+setcc should have been generated is valid. But, its still a valid IR syntactically and semantically which could be thrown at backend.

I don't see the point of this argument. There are an infinite number of valid IR patterns that we can send to the backend, but the main point of the IR optimizer is to limit that set, so we don't have to increase the complexity of the backend. That's what I'm requesting here: solve this in IR, so the backend never has to worry about it. I'll ask again: who is producing this IR or these nodes? If I'm missing some scenario in which this pattern can be created in the backend, then I agree that we will have to handle it there (but still not the target-specific way you've proposed). If not, let's not add unnecessary code.

> Please consider following case which is being compiled for arm( -mcpu=cortex-r52 -march=arm  test.ll -mattr=fp-armv8)
> 
> define <4 x double> @CASEA(<4 x double> %x, <4 x double> %y) {
> 
>   %z = call fast <4 x double> @llvm.minnum.v4f64(<4 x double> %x, <4 x double> %y) readnone
>   ret <4 x double> %z
> 
> }
> 
> define <4 x double> @CASEB(<4 x double> %x, <4 x double> %y) {
> 
>   %c = fcmp ule <4 x double> %x, %y
>   %z = select <4 x i1> %c, <4 x double> %x, <4 x double> %y
>   ret <4 x double> %z
> 
> }
> 
>      
> 
> Instruction selector does not generates vminnm for CASEB where as same is generated for CASEA. 
>  As of now SelectionDAGBuilder generates fminnum (or fminnnan) SDNode for llvm.minnum intrinsic. 
>  Thus different targets lowers fminnum (SDNode) differently.

I acknowledge there may be other bugs here, but I think the ARM backend is behaving correctly for the example as shown (cc'ing @efriedma for expertise).

If we want to create equivalent patterns for these 2 examples, then we must add 'fast' to the fcmp in the 2nd case:

  %c = fcmp fast ule <4 x double> %x, %y

If we do that, we see that the ARM backend is still behaving correctly and optimally to produce vminnm:

$ ./llc -o - minnum.ll -mcpu=cortex-r52 -march=arm  -mattr=fp-armv8 | egrep 'CASE|minnm'
_CASEA:
	vminnm.f64	d25, d22, d17
	vminnm.f64	d24, d23, d16
	vminnm.f64	d17, d21, d19
	vminnm.f64	d16, d20, d18
_CASEB:
	vminnm.f64	d25, d22, d17
	vminnm.f64	d24, d23, d16
	vminnm.f64	d17, d21, d19
	vminnm.f64	d16, d20, d18

https://reviews.llvm.org/D37616