[cfe-dev] food for optimizer developers

Tue Aug 10 03:59:07 PDT 2010

> I wrote a Fortran to C++ conversion program that I used to convert selected
> LAPACK sources. Comparing runtimes with different compilers I get:
> 
>                          absolute  relative
> ifort 11.1.072             1.790s     1.00
> gfortran 4.4.4             2.470s     1.38
> g++ 4.4.4                  2.922s     1.63
> clang++ 2.8 (trunk 108205) 6.487s     3.62

> - Why is the code generated by clang++ so much slower than the g++ code?

A "hot spot" in your benchmark dsyev_test.cpp is this loop in dlasr()

  FEM_DO(i, 1, m) {
    temp = a(i, j + 1);
    a(i, j + 1) = ctemp * temp - stemp * a(i, j);
    a(i, j) = stemp * temp + ctemp * a(i, j);
  }

For the loop body, g++ (4.2) emits unsurprising code.
loop:				
movsd    (%rcx), %xmm2
movapd   %xmm3, %xmm0			
mulsd    %xmm2, %xmm0			
movapd   %xmm4, %xmm1			
mulsd    (%rax), %xmm1			
subsd    %xmm1, %xmm0			
movsd    %xmm0, (%rcx)			
movapd   %xmm3, %xmm0			
mulsd    (%rax), %xmm0			
mulsd    %xmm4, %xmm2			
addsd    %xmm2, %xmm0			
movsd    %xmm0, (%rax)			
incl     %esi			
addq     $8, %rcx			
addq     $8, %rax			
cmpl     %esi, +0(%r13)			
jge      loop	

clang++ (2.8) misses major optimizations accessing the 'a' array, and makes no less than 3 laborious address calculations.
loop:	
movq     %rax, %rdi
subq     %rdx, %rdi			
imulq    %r14, %rdi
subq     %rcx, %rdi			
addq     %rsi, %rdi			
movq     +0(%r13), %r8			
movsd    (%r8, %rdi, 8), %xmm3			
mulsd    %xmm1, %xmm3			
movq     %rbx, %rdi			
subq     %rdx, %rdi			
imulq    %r14, %rdi
subq     %rcx, %rdi			
addq     %rsi, %rdi			
movsd    (%r8, %rdi, 8), %xmm4			
movapd   %xmm2, %xmm5			
mulsd    %xmm4, %xmm5			
subsd    %xmm3, %xmm5			
movsd    %xmm5, (%r8, %rdi, 8)			
movq     +32(%r13), %rdx			
movq     %rax, %rdi			
subq     %rdx, %rdi			
movq     +0(%r13), %r8			
movq     +8(%r13), %r14			
imulq    %r14, %rdi
movq     +24(%r13), %rcx			
subq     %rcx, %rdi			
addq     %rsi, %rdi			
movsd    (%r8, %rdi, 8), %xmm3			
mulsd    %xmm2, %xmm3			
mulsd    %xmm1, %xmm4			
addsd    %xmm3, %xmm4			
movsd    %xmm4, (%r8, %rdi, 8)			
incq     %rsi			
cmpl     (%r15), %esi			
jle      loop

Presumably clang++, in its present state of development, is not smart enough to notice the underlying simple sequential access pattern, when the array is declared
arr_ref<double, 2> a

I think clang has no trouble optimizing properly for arrays like this:
double  a[800][800];

Robert P.