[cfe-dev] Clang ignoring --fast-math for complex division, serious performance hit

Mon Nov 6 10:29:38 PST 2017

(actually cc'ing Alex this time)

On 11/06/2017 12:18 PM, Hal Finkel via cfe-dev wrote:
> [+Alex]
>
> On 11/06/2017 11:59 AM, John McCall wrote:
>>> On Nov 6, 2017, at 12:21 PM, Richard Campbell <rlcamp.pdx at gmail.com> 
>>> wrote:
>>>> On Nov 6, 2017, at 12:20 AM, John McCall <rjmccall at apple.com> wrote:
>>>>
>>>>> On Nov 6, 2017, at 2:47 AM, Richard Campbell 
>>>>> <rlcamp.pdx at gmail.com> wrote:
>>>>> The much bigger issue is not on division or two, but rather zero 
>>>>> function calls or one. The function call overhead, and the 
>>>>> resulting inability to make any other refactoring optimisations, 
>>>>> far outweighs the choice of instructions used.
>>>> By "refactoring optimizations", do you mean reordering and 
>>>> potentially CSE'ing the component arithmetic with operations 
>>>> outside of the division, or do you mean the compiler-barrier costs 
>>>> of emitting an opaque function call in the frontend instead of 
>>>> something that can be CSE'ed / reordered itself? Because the latter 
>>>> is a problem that can be fixed for non-fast-math arithmetic as well.
>>>>
>>>> My general impression is that there is a lot of low-hanging fruit 
>>>> in optimizing complex math in LLVM for one simple reason: it's not 
>>>> widely used, so it's an accordingly low priority for most of our 
>>>> current contributors.  If this is something that interests you, 
>>>> we'd be very open to contributions.
>>>>
>>>>
>>>> John.
>>> I suppose I mean both of those optimisations, although I don’t know 
>>> the actual breakdown of the performance hit of one vs the other vs 
>>> just the fact of the function call. When one writes a critical inner 
>>> loop that doesn’t contain any function calls, one should reasonably 
>>> expect the compiler not to add them.
>> Complex divide is a large, complicated operation when full precision 
>> and infinity-correctness is required.  We appreciate that you have 
>> performance constraints, but implementing it with an outlined 
>> function is not an unreasonable choice.
>>
>>> While there may be more low hanging fruit, I don’t want it to get in 
>>> the way of fixing this. My main concern is that there not be 
>>> noticeable regressions. This particular regression has the potential 
>>> to result in certain calculations taking HOURS longer than expected, 
>>> if I hadn’t been hacking my way around it already. I would greatly 
>>> prefer to write simple maintainable code and let the compiler do the 
>>> right thing on the hardware of today and tomorrow.
>> Richard, let me be clear about your options here.  If you're 
>> interested in working on this, that would be great, and I'd be happy 
>> to review your patches.  If you're not interested in working on this, 
>> then you should file a bug and hope that someone else has the 
>> motivation to pick it up.
>
> I'd like to add that Alex L. looked at this in some detail in 2013. 
> For some relevant notes, see PR17248 (and how divide is handled in 
> https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp). 
> There are indeed more- and less-numerically-stable ways to implement 
> complex division. For an extended discussion, I recommend looking at 
> https://arxiv.org/pdf/1210.4539v2.pdf -- There are certainly versions 
> that are reasonable to inline, especially in the fast-math context, 
> and I support doing so. Alex found that we had to use Smith's 
> algorithm in order to pass LAPACK's regression tests.
>
>  -Hal
>
>>
>> John.
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory