[llvm-bugs] [Bug 31602] New: [X86] float/double -> unsigned long conversion slow when inputs are predictable

Tue Jan 10 18:54:44 PST 2017

https://llvm.org/bugs/show_bug.cgi?id=31602

            Bug ID: 31602
           Summary: [X86] float/double -> unsigned long conversion slow
                    when inputs are predictable
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: Backend: X86
          Assignee: unassignedbugs at nondot.org
          Reporter: mkuper at google.com
                CC: llvm-bugs at lists.llvm.org
    Classification: Unclassified

SSE and AVX (up until AVX512) don't have convert instructions from FP (both
float or double) and unsigned long. So, these conversion have to be emulated
using FP -> signed long conversions.

GCC lowers this:
unsigned long foo(double x) {
  return x;
}

as:
foo(double):
        movsd   .LC0(%rip), %xmm1
        ucomisd %xmm1, %xmm0
        jnb     .L2
        cvttsd2siq      %xmm0, %rax
        ret
.L2:
        subsd   %xmm1, %xmm0
        movabsq $-9223372036854775808, %rdx
        cvttsd2siq      %xmm0, %rax
        xorq    %rdx, %rax
        ret
.LC0:
        .long   0
        .long   1138753536

That is - check whether the value is in range, and if not, force it into range,
convert, and correct the value.

What we do, on the other hand, is:

.LCPI0_0:
        .quad   4890909195324358656     # double 9.2233720368547758E+18
foo(double):
        movsd   .LCPI0_0(%rip), %xmm1
        movapd  %xmm0, %xmm2
        subsd   %xmm1, %xmm2
        cvttsd2si       %xmm2, %rax
        movabsq $-9223372036854775808, %rcx # imm = 0x8000000000000000
        xorq    %rax, %rcx
        cvttsd2si       %xmm0, %rax
        ucomisd %xmm1, %xmm0
        cmovaeq %rcx, %rax
        retq

Which is basically an if-converted version of the GCC code.

Since cvttsd2si has a fairly long latency, the GCC version is much faster when
the branch is well-predicted, and slower when it's not.
But it seems like in most cases this branch should be well-predicted - e.g. if
all inputs are "small", and actually fit into the signed range.

A few additional notes:

1) Our current version is problematic in the presence of FP exceptions, see
PR17686.

2) I tried playing around with selecting on the input instead of the output,
but that doesn't really improve the situation, since we then need to adjust the
sign bit of the output of one of the converts.
There are two options here - (1) adjusting and selecting again between the
original and the adjusted version, or (2) fudging the adjustment so that it's a
nop for the right convert. ICC generates code which is basically (2). This
avoids the problem in PR17686, but both options appear to be even slower than
what we have now.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20170111/37fbc672/attachment-0001.html>