[cfe-dev] food for optimizer developers
Chris Lattner
clattner at apple.com
Tue Aug 10 08:42:37 PDT 2010
On Aug 10, 2010, at 3:59 AM, Robert Purves wrote:
>
>> I wrote a Fortran to C++ conversion program that I used to convert selected
>> LAPACK sources. Comparing runtimes with different compilers I get:
>>
>> absolute relative
>> ifort 11.1.072 1.790s 1.00
>> gfortran 4.4.4 2.470s 1.38
>> g++ 4.4.4 2.922s 1.63
>> clang++ 2.8 (trunk 108205) 6.487s 3.62
>
>> - Why is the code generated by clang++ so much slower than the g++ code?
>
> A "hot spot" in your benchmark dsyev_test.cpp is this loop in dlasr()
>
> FEM_DO(i, 1, m) {
> temp = a(i, j + 1);
> a(i, j + 1) = ctemp * temp - stemp * a(i, j);
> a(i, j) = stemp * temp + ctemp * a(i, j);
> }
Please file a bug with the reduced .cpp testcase. My wild guess is that this is a failure because we don't have TBAA yet, which isn't being worked on. What flags are you passing to the compiler? Anything like -ffast-math? Note that ifort defaults to "fast and loose" numerics iirc.
-Chris
>
> For the loop body, g++ (4.2) emits unsurprising code.
> loop:
> movsd (%rcx), %xmm2
> movapd %xmm3, %xmm0
> mulsd %xmm2, %xmm0
> movapd %xmm4, %xmm1
> mulsd (%rax), %xmm1
> subsd %xmm1, %xmm0
> movsd %xmm0, (%rcx)
> movapd %xmm3, %xmm0
> mulsd (%rax), %xmm0
> mulsd %xmm4, %xmm2
> addsd %xmm2, %xmm0
> movsd %xmm0, (%rax)
> incl %esi
> addq $8, %rcx
> addq $8, %rax
> cmpl %esi, +0(%r13)
> jge loop
>
> clang++ (2.8) misses major optimizations accessing the 'a' array, and makes no less than 3 laborious address calculations.
> loop:
> movq %rax, %rdi
> subq %rdx, %rdi
> imulq %r14, %rdi
> subq %rcx, %rdi
> addq %rsi, %rdi
> movq +0(%r13), %r8
> movsd (%r8, %rdi, 8), %xmm3
> mulsd %xmm1, %xmm3
> movq %rbx, %rdi
> subq %rdx, %rdi
> imulq %r14, %rdi
> subq %rcx, %rdi
> addq %rsi, %rdi
> movsd (%r8, %rdi, 8), %xmm4
> movapd %xmm2, %xmm5
> mulsd %xmm4, %xmm5
> subsd %xmm3, %xmm5
> movsd %xmm5, (%r8, %rdi, 8)
> movq +32(%r13), %rdx
> movq %rax, %rdi
> subq %rdx, %rdi
> movq +0(%r13), %r8
> movq +8(%r13), %r14
> imulq %r14, %rdi
> movq +24(%r13), %rcx
> subq %rcx, %rdi
> addq %rsi, %rdi
> movsd (%r8, %rdi, 8), %xmm3
> mulsd %xmm2, %xmm3
> mulsd %xmm1, %xmm4
> addsd %xmm3, %xmm4
> movsd %xmm4, (%r8, %rdi, 8)
> incq %rsi
> cmpl (%r15), %esi
> jle loop
>
> Presumably clang++, in its present state of development, is not smart enough to notice the underlying simple sequential access pattern, when the array is declared
> arr_ref<double, 2> a
>
> I think clang has no trouble optimizing properly for arrays like this:
> double a[800][800];
>
> Robert P.
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
More information about the cfe-dev
mailing list