[cfe-dev] Clang ignoring --fast-math for complex division, serious performance hit

Mon Nov 6 00:20:16 PST 2017

> On Nov 6, 2017, at 2:47 AM, Richard Campbell <rlcamp.pdx at gmail.com> wrote:
>> On Nov 5, 2017, at 23:06, Steve Canon <scanon at apple.com> wrote:
>>> On Nov 5, 2017, at 22:48, John McCall <rjmccall at apple.com> wrote:
>>>>> On Nov 6, 2017, at 12:43 AM, Richard Campbell <rlcamp.pdx at gmail.com> wrote:
>>>>>> On Nov 5, 2017, at 9:28 PM, John McCall <rjmccall at apple.com> wrote:
>>>>>> 
>>>>>> On Nov 5, 2017, at 12:43 PM, Richard Campbell via cfe-dev <cfe-dev at lists.llvm.org> wrote:
>>>>>> Similarly to a problem that occurred two years ago with multiplication (http://lists.llvm.org/pipermail/cfe-dev/2015-July/043792.html), production Clang (Apple LLVM version 9.0.0 (clang-900.0.38)) is now issuing __divsc3 function calls anywhere complex division occurs, irrespective of the -ffast-math setting. This can cause a single division (which should be one or at most two divss instructions, with minimal performance impact) to slow down a performance-critical section of code (single-precision complex tridiagonal matrix solving) significantly. I’m not sure how long this has been the case - for some time now, in my critical loop, I instead call an inline function which explicitly does the single divss required. When I take out this hack, the speed of the entire application is cut in half.
>>>>>> 
>>>>>> Can we fix this (again) and maybe add a code comment wherever it needs to be such that it doesn’t keep getting broken after a couple years?
>>>>> 
>>>>> These don't sound like the same bug; they're just the same overall problem observed in different patterns of source code.  The best way to prevent regressions in cases like this is to add some tests that the source patterns generate the desired output.
>>>>> 
>>>>> I can't imagine how the general case of complex division in general could possibly be implemented in "one or two divss instructions", so I suspect you're talking about the special case of dividing a complex number by a scalar (either real or imaginary).  I wouldn't be that surprised if attempts got broken if we didn't have a good, exhaustive set of tests for it.
>>>> 
>>>> __attribute__((always_inline)) static inline float _Complex __divsc3(const float ar, const float ai, const float br, const float bi) {
>>>> const float one_over_denominator = 1.0f / (br * br + bi * bi);
>>>> return (float _Complex){ (ar * br + ai * bi) * one_over_denominator, (ai * br - ar * bi) * one_over_denominator };
>>>> }
>>>> 
>>>> This contains a single real divide, and is general. I proposed it two years ago for -ffast-math/-freciprocal-math but didn’t get any takers. Clang and gcc both normally (until this regression) implement a complex divide using two divide instructions.
>>> 
>>> I see.  You're saying that, under -ffast-math, we can implement a full complex division with one reciprocal plus a bunch of other operations, not that we can implement it with just one division.  I believe you're correct that that's acceptable under the usual assumptions of -ffast-math, as well as that it's likely to generally be faster to do that, but I'm not an expert; I've CC'ed a few people who are.  One of those people, Chandler, happens to be the person who last changed this (with r219557 in October 2014).  If they sign off on the idea, I'd be happy to accept a patch making Clang emit an inlined version under -ffast-math using a single reciprocal.
>> 
>> Yes, this is fine for fast-math.

> The much bigger issue is not on division or two, but rather zero function calls or one. The function call overhead, and the resulting inability to make any other refactoring optimisations, far outweighs the choice of instructions used.

By "refactoring optimizations", do you mean reordering and potentially CSE'ing the component arithmetic with operations outside of the division, or do you mean the compiler-barrier costs of emitting an opaque function call in the frontend instead of something that can be CSE'ed / reordered itself?  Because the latter is a problem that can be fixed for non-fast-math arithmetic as well.

My general impression is that there is a lot of low-hanging fruit in optimizing complex math in LLVM for one simple reason: it's not widely used, so it's an accordingly low priority for most of our current contributors.  If this is something that interests you, we'd be very open to contributions.

John.