[cfe-dev] Complex arithmetic ignores -ffast-math after clang r219557, serious performance regressions

Sat Jul 4 17:58:02 PDT 2015

Thanks for the tip, it seems to be working just fine. I’ll leave it in my code until this gets fixed in Clang.

Along similar lines, couldn’t we define the -ffast-math/-freciprocal-math version of __divsc3 as:

__attribute__((always_inline)) static inline float _Complex __divsc3(const float ar, const float ai, const float br, const float bi) {
    const float one_over_denominator = 1.0f / (br * br + bi * bi);
    return (float _Complex){ (ar * br + ai * bi) * one_over_denominator, (ai * br - ar * bi) * one_over_denominator };
}

To the best of my knowledge, I’ve always seen both gcc and clang emit two [v]divss instructions when dividing two complex numbers, or taking the reciprocal of one complex number, even though only a single real divide is necessary. In my critical code whenever a divide is absolutely necessary, I have to write this out, in order to get the single divss instruction.

R Campbell

> On Jul 3, 2015, at 7:44 PM, Matthijs van Duin <matthijsvanduin at gmail.com> wrote:
> 
> A temporary workaround is defining __mulsc3 in your own code... clang seems to pick up on it correctly, e.g.:
> __attribute__(( always_inline ))
> static inline  float _Complex
> __mulsc3( float ar, float ai, float br, float bi)
> {
> 	return (float _Complex){  ar * br - ai * bi,  ar * bi + ai * br  };
> }
> 
> 
> I've noticed it really needs to be static always_inline to get optimized properly. At least using latest clang-3.7 from debian sid with:
> -target arm-linux-gnueabihf -mfloat-abi=hard -mcpu=cortex-a8 -mfpu=neon -Ofast
> 
> Different storage class specifications give fascinating differences, even with a function as simple as return a * b; where a and b are its complex float arguments.
> 
> Two curious observations:
> * If my __mulsc3 is declared "extern inline", clang nevertheless emits code for it. I had expected any non-inlineable uses to become references to the standard one.
> * If it is declared static (inline or not) it acquires soft float ABI calling conventions (with associated terrible overhead), and it still gets called in places where __mulsc3 would normally get called. Using always_inline avoids this.
> 
> 
> (Since you're declaring complex mul, you can of course take the opportunity to see if there's any benefit in a different implementation of complex multiply, e.g.
> 	float t = ai * ( br - bi );
> 	return (float _Complex){  br * (ar - ai) + t,  bi * (ar + ai) + t  };
> 
> or one of its many variants. Probably not unless your target has a slow multiplier or the relevant sums/differences are needed already anyway, but who knows...)
> 
> Matthijs