[llvm-bugs] [Bug 25277] New: Very high overhead when calling sin() with int64_t
via llvm-bugs
llvm-bugs at lists.llvm.org
Wed Oct 21 16:19:52 PDT 2015
https://llvm.org/bugs/show_bug.cgi?id=25277
Bug ID: 25277
Summary: Very high overhead when calling sin() with int64_t
Product: clang
Version: 3.7
Hardware: PC
OS: Linux
Status: NEW
Severity: normal
Priority: P
Component: LLVM Codegen
Assignee: unassignedclangbugs at nondot.org
Reporter: yyc1992 at gmail.com
CC: llvm-bugs at lists.llvm.org
Classification: Unclassified
In the following code, when compiling with `clang++ -O2` or `-Ofast`, test_i64
takes almost twice as long as test_f64 with similar arguments. (the `extern
"C"` etc are used to benchmark it in Julia. The inline asm is to make sure the
compiler doesn't optimize the whole thing away in `-Ofast` and doesn't seem to
have a performance impact otherwise.)
```
#include <math.h>
#include <stddef.h>
#include <stdint.h>
double __attribute__((noinline))
g(int64_t a)
{
asm volatile("" ::: "memory");
return sin(a);
}
double __attribute__((noinline))
g(double a)
{
asm volatile("" ::: "memory");
return sin(a);
}
template<typename T>
static inline void
f(T a, size_t n)
{
for (size_t i = 0;i < n;i++) {
g(a);
}
}
extern "C" void
test_i64(int64_t a, size_t n)
{
f(a, n);
}
extern "C" void
test_f64(double a, size_t n)
{
f(a, n);
}
```
On my machine, each loop takes ~ 9ns for double and ~ 14ns for int64_t. Gcc
doesn't seem to have this issue.
Gcc asm,
```
00000000000007a0 <_Z1gl>:
7a0: 66 0f ef c0 pxor %xmm0,%xmm0
7a4: f2 48 0f 2a c7 cvtsi2sd %rdi,%xmm0
7a9: e9 e2 fe ff ff jmpq 690 <sin at plt>
7ae: 66 90 xchg %ax,%ax
00000000000007b0 <_Z1gd>:
7b0: e9 db fe ff ff jmpq 690 <sin at plt>
7b5: 90 nop
7b6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
7bd: 00 00 00
00000000000007c0 <test_i64>:
7c0: 48 85 f6 test %rsi,%rsi
7c3: 74 28 je 7ed <test_i64+0x2d>
7c5: 41 54 push %r12
7c7: 49 89 fc mov %rdi,%r12
7ca: 55 push %rbp
7cb: 48 89 f5 mov %rsi,%rbp
7ce: 53 push %rbx
7cf: 31 db xor %ebx,%ebx
7d1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
7d8: 4c 89 e7 mov %r12,%rdi
7db: 48 83 c3 01 add $0x1,%rbx
7df: e8 9c fe ff ff callq 680 <_Z1gl at plt>
7e4: 48 39 dd cmp %rbx,%rbp
7e7: 75 ef jne 7d8 <test_i64+0x18>
7e9: 5b pop %rbx
7ea: 5d pop %rbp
7eb: 41 5c pop %r12
7ed: f3 c3 repz retq
7ef: 90 nop
00000000000007f0 <test_f64>:
7f0: 55 push %rbp
7f1: 53 push %rbx
7f2: 48 83 ec 18 sub $0x18,%rsp
7f6: 48 85 ff test %rdi,%rdi
7f9: f2 0f 11 44 24 08 movsd %xmm0,0x8(%rsp)
7ff: 74 23 je 824 <test_f64+0x34>
801: 48 89 fd mov %rdi,%rbp
804: 31 db xor %ebx,%ebx
806: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
80d: 00 00 00
810: f2 0f 10 44 24 08 movsd 0x8(%rsp),%xmm0
816: 48 83 c3 01 add $0x1,%rbx
81a: e8 51 fe ff ff callq 670 <_Z1gd at plt>
81f: 48 39 dd cmp %rbx,%rbp
822: 75 ec jne 810 <test_f64+0x20>
824: 48 83 c4 18 add $0x18,%rsp
828: 5b pop %rbx
829: 5d pop %rbp
82a: c3 retq
```
clang asm
```
0000000000000750 <_Z1gl>:
750: f2 48 0f 2a c7 cvtsi2sd %rdi,%xmm0
755: e9 e6 fe ff ff jmpq 640 <sin at plt>
75a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000000760 <_Z1gd>:
760: e9 db fe ff ff jmpq 640 <sin at plt>
765: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1)
76c: 00 00 00 00
0000000000000770 <test_i64>:
770: 41 56 push %r14
772: 53 push %rbx
773: 50 push %rax
774: 48 89 f3 mov %rsi,%rbx
777: 49 89 fe mov %rdi,%r14
77a: 48 85 db test %rbx,%rbx
77d: 74 0e je 78d <test_i64+0x1d>
77f: 90 nop
780: 4c 89 f7 mov %r14,%rdi
783: e8 a8 fe ff ff callq 630 <_Z1gl at plt>
788: 48 ff cb dec %rbx
78b: 75 f3 jne 780 <test_i64+0x10>
78d: 48 83 c4 08 add $0x8,%rsp
791: 5b pop %rbx
792: 41 5e pop %r14
794: c3 retq
795: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1)
79c: 00 00 00 00
00000000000007a0 <test_f64>:
7a0: 53 push %rbx
7a1: 48 83 ec 10 sub $0x10,%rsp
7a5: 48 89 fb mov %rdi,%rbx
7a8: f2 0f 11 44 24 08 movsd %xmm0,0x8(%rsp)
7ae: 48 85 db test %rbx,%rbx
7b1: 74 1d je 7d0 <test_f64+0x30>
7b3: 66 66 66 66 2e 0f 1f data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
7ba: 84 00 00 00 00 00
7c0: f2 0f 10 44 24 08 movsd 0x8(%rsp),%xmm0
7c6: e8 55 fe ff ff callq 620 <_Z1gd at plt>
7cb: 48 ff cb dec %rbx
7ce: 75 f0 jne 7c0 <test_f64+0x20>
7d0: 48 83 c4 10 add $0x10,%rsp
7d4: 5b pop %rbx
7d5: c3 retq
```
One noticable difference is that gcc clears the xmm0 register before doing the
i64->f64 conversion (not sure if that's the important difference).
(The same happens in JIT environment as well but it seems to be x86 (or x64)
specific...)
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20151021/fb636c8a/attachment-0001.html>
More information about the llvm-bugs
mailing list