[cfe-dev] Clang ignoring --fast-math for complex division, serious performance hit
Hal Finkel via cfe-dev
cfe-dev at lists.llvm.org
Thu Nov 9 16:05:07 PST 2017
On 11/06/2017 12:29 PM, Hal Finkel via cfe-dev wrote:
> (actually cc'ing Alex this time)
>
> On 11/06/2017 12:18 PM, Hal Finkel via cfe-dev wrote:
>> [+Alex]
>>
>> On 11/06/2017 11:59 AM, John McCall wrote:
>>>> On Nov 6, 2017, at 12:21 PM, Richard Campbell
>>>> <rlcamp.pdx at gmail.com> wrote:
>>>>> On Nov 6, 2017, at 12:20 AM, John McCall <rjmccall at apple.com> wrote:
>>>>>
>>>>>> On Nov 6, 2017, at 2:47 AM, Richard Campbell
>>>>>> <rlcamp.pdx at gmail.com> wrote:
>>>>>> The much bigger issue is not on division or two, but rather zero
>>>>>> function calls or one. The function call overhead, and the
>>>>>> resulting inability to make any other refactoring optimisations,
>>>>>> far outweighs the choice of instructions used.
>>>>> By "refactoring optimizations", do you mean reordering and
>>>>> potentially CSE'ing the component arithmetic with operations
>>>>> outside of the division, or do you mean the compiler-barrier costs
>>>>> of emitting an opaque function call in the frontend instead of
>>>>> something that can be CSE'ed / reordered itself? Because the
>>>>> latter is a problem that can be fixed for non-fast-math arithmetic
>>>>> as well.
>>>>>
>>>>> My general impression is that there is a lot of low-hanging fruit
>>>>> in optimizing complex math in LLVM for one simple reason: it's not
>>>>> widely used, so it's an accordingly low priority for most of our
>>>>> current contributors. If this is something that interests you,
>>>>> we'd be very open to contributions.
>>>>>
>>>>>
>>>>> John.
>>>> I suppose I mean both of those optimisations, although I don’t know
>>>> the actual breakdown of the performance hit of one vs the other vs
>>>> just the fact of the function call. When one writes a critical
>>>> inner loop that doesn’t contain any function calls, one should
>>>> reasonably expect the compiler not to add them.
>>> Complex divide is a large, complicated operation when full precision
>>> and infinity-correctness is required. We appreciate that you have
>>> performance constraints, but implementing it with an outlined
>>> function is not an unreasonable choice.
>>>
>>>> While there may be more low hanging fruit, I don’t want it to get
>>>> in the way of fixing this. My main concern is that there not be
>>>> noticeable regressions. This particular regression has the
>>>> potential to result in certain calculations taking HOURS longer
>>>> than expected, if I hadn’t been hacking my way around it already. I
>>>> would greatly prefer to write simple maintainable code and let the
>>>> compiler do the right thing on the hardware of today and tomorrow.
>>> Richard, let me be clear about your options here. If you're
>>> interested in working on this, that would be great, and I'd be happy
>>> to review your patches. If you're not interested in working on
>>> this, then you should file a bug and hope that someone else has the
>>> motivation to pick it up.
>>
>> I'd like to add that Alex L. looked at this in some detail in 2013.
>> For some relevant notes, see PR17248 (and how divide is handled in
>> https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp).
>> There are indeed more- and less-numerically-stable ways to implement
>> complex division. For an extended discussion, I recommend looking at
>> https://arxiv.org/pdf/1210.4539v2.pdf -- There are certainly versions
>> that are reasonable to inline, especially in the fast-math context,
>> and I support doing so. Alex found that we had to use Smith's
>> algorithm in order to pass LAPACK's regression tests.
One more thing, we can use the cheaper (but less numerically-stable
formula) when we have #pragma STDC CX_LIMITED_RANGE ON.
-Hal
>>
>> -Hal
>>
>>>
>>> John.
>>
>
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
More information about the cfe-dev
mailing list