[cfe-dev] Clang ignoring --fast-math for complex division, serious performance hit

Thu Nov 9 16:05:07 PST 2017

On 11/06/2017 12:29 PM, Hal Finkel via cfe-dev wrote:
> (actually cc'ing Alex this time)
>
> On 11/06/2017 12:18 PM, Hal Finkel via cfe-dev wrote:
>> [+Alex]
>>
>> On 11/06/2017 11:59 AM, John McCall wrote:
>>>> On Nov 6, 2017, at 12:21 PM, Richard Campbell 
>>>> <rlcamp.pdx at gmail.com> wrote:
>>>>> On Nov 6, 2017, at 12:20 AM, John McCall <rjmccall at apple.com> wrote:
>>>>>
>>>>>> On Nov 6, 2017, at 2:47 AM, Richard Campbell 
>>>>>> <rlcamp.pdx at gmail.com> wrote:
>>>>>> The much bigger issue is not on division or two, but rather zero 
>>>>>> function calls or one. The function call overhead, and the 
>>>>>> resulting inability to make any other refactoring optimisations, 
>>>>>> far outweighs the choice of instructions used.
>>>>> By "refactoring optimizations", do you mean reordering and 
>>>>> potentially CSE'ing the component arithmetic with operations 
>>>>> outside of the division, or do you mean the compiler-barrier costs 
>>>>> of emitting an opaque function call in the frontend instead of 
>>>>> something that can be CSE'ed / reordered itself? Because the 
>>>>> latter is a problem that can be fixed for non-fast-math arithmetic 
>>>>> as well.
>>>>>
>>>>> My general impression is that there is a lot of low-hanging fruit 
>>>>> in optimizing complex math in LLVM for one simple reason: it's not 
>>>>> widely used, so it's an accordingly low priority for most of our 
>>>>> current contributors.  If this is something that interests you, 
>>>>> we'd be very open to contributions.
>>>>>
>>>>>
>>>>> John.
>>>> I suppose I mean both of those optimisations, although I don’t know 
>>>> the actual breakdown of the performance hit of one vs the other vs 
>>>> just the fact of the function call. When one writes a critical 
>>>> inner loop that doesn’t contain any function calls, one should 
>>>> reasonably expect the compiler not to add them.
>>> Complex divide is a large, complicated operation when full precision 
>>> and infinity-correctness is required.  We appreciate that you have 
>>> performance constraints, but implementing it with an outlined 
>>> function is not an unreasonable choice.
>>>
>>>> While there may be more low hanging fruit, I don’t want it to get 
>>>> in the way of fixing this. My main concern is that there not be 
>>>> noticeable regressions. This particular regression has the 
>>>> potential to result in certain calculations taking HOURS longer 
>>>> than expected, if I hadn’t been hacking my way around it already. I 
>>>> would greatly prefer to write simple maintainable code and let the 
>>>> compiler do the right thing on the hardware of today and tomorrow.
>>> Richard, let me be clear about your options here.  If you're 
>>> interested in working on this, that would be great, and I'd be happy 
>>> to review your patches.  If you're not interested in working on 
>>> this, then you should file a bug and hope that someone else has the 
>>> motivation to pick it up.
>>
>> I'd like to add that Alex L. looked at this in some detail in 2013. 
>> For some relevant notes, see PR17248 (and how divide is handled in 
>> https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp). 
>> There are indeed more- and less-numerically-stable ways to implement 
>> complex division. For an extended discussion, I recommend looking at 
>> https://arxiv.org/pdf/1210.4539v2.pdf -- There are certainly versions 
>> that are reasonable to inline, especially in the fast-math context, 
>> and I support doing so. Alex found that we had to use Smith's 
>> algorithm in order to pass LAPACK's regression tests.

One more thing, we can use the cheaper (but less numerically-stable 
formula) when we have #pragma STDC CX_LIMITED_RANGE ON.

  -Hal

>>
>>  -Hal
>>
>>>
>>> John.
>>
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory