[cfe-dev] Clang ignoring --fast-math for complex division, serious performance hit

Mon Nov 6 10:18:04 PST 2017

[+Alex]

On 11/06/2017 11:59 AM, John McCall wrote:
>> On Nov 6, 2017, at 12:21 PM, Richard Campbell <rlcamp.pdx at gmail.com> wrote:
>>> On Nov 6, 2017, at 12:20 AM, John McCall <rjmccall at apple.com> wrote:
>>>
>>>> On Nov 6, 2017, at 2:47 AM, Richard Campbell <rlcamp.pdx at gmail.com> wrote:
>>>> The much bigger issue is not on division or two, but rather zero function calls or one. The function call overhead, and the resulting inability to make any other refactoring optimisations, far outweighs the choice of instructions used.
>>> By "refactoring optimizations", do you mean reordering and potentially CSE'ing the component arithmetic with operations outside of the division, or do you mean the compiler-barrier costs of emitting an opaque function call in the frontend instead of something that can be CSE'ed / reordered itself?  Because the latter is a problem that can be fixed for non-fast-math arithmetic as well.
>>>
>>> My general impression is that there is a lot of low-hanging fruit in optimizing complex math in LLVM for one simple reason: it's not widely used, so it's an accordingly low priority for most of our current contributors.  If this is something that interests you, we'd be very open to contributions.
>>>
>>>
>>> John.
>> I suppose I mean both of those optimisations, although I don’t know the actual breakdown of the performance hit of one vs the other vs just the fact of the function call. When one writes a critical inner loop that doesn’t contain any function calls, one should reasonably expect the compiler not to add them.
> Complex divide is a large, complicated operation when full precision and infinity-correctness is required.  We appreciate that you have performance constraints, but implementing it with an outlined function is not an unreasonable choice.
>
>> While there may be more low hanging fruit, I don’t want it to get in the way of fixing this. My main concern is that there not be noticeable regressions. This particular regression has the potential to result in certain calculations taking HOURS longer than expected, if I hadn’t been hacking my way around it already. I would greatly prefer to write simple maintainable code and let the compiler do the right thing on the hardware of today and tomorrow.
> Richard, let me be clear about your options here.  If you're interested in working on this, that would be great, and I'd be happy to review your patches.  If you're not interested in working on this, then you should file a bug and hope that someone else has the motivation to pick it up.

I'd like to add that Alex L. looked at this in some detail in 2013. For 
some relevant notes, see PR17248 (and how divide is handled in 
https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp). 
There are indeed more- and less-numerically-stable ways to implement 
complex division. For an extended discussion, I recommend looking at 
https://arxiv.org/pdf/1210.4539v2.pdf -- There are certainly versions 
that are reasonable to inline, especially in the fast-math context, and 
I support doing so. Alex found that we had to use Smith's algorithm in 
order to pass LAPACK's regression tests.

  -Hal

>
> John.

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory