[cfe-dev] food for optimizer developers

Chris Lattner clattner at apple.com
Tue Aug 10 08:42:37 PDT 2010


On Aug 10, 2010, at 3:59 AM, Robert Purves wrote:

> 
>> I wrote a Fortran to C++ conversion program that I used to convert selected
>> LAPACK sources. Comparing runtimes with different compilers I get:
>> 
>>                         absolute  relative
>> ifort 11.1.072             1.790s     1.00
>> gfortran 4.4.4             2.470s     1.38
>> g++ 4.4.4                  2.922s     1.63
>> clang++ 2.8 (trunk 108205) 6.487s     3.62
> 
>> - Why is the code generated by clang++ so much slower than the g++ code?
> 
> A "hot spot" in your benchmark dsyev_test.cpp is this loop in dlasr()
> 
>  FEM_DO(i, 1, m) {
>    temp = a(i, j + 1);
>    a(i, j + 1) = ctemp * temp - stemp * a(i, j);
>    a(i, j) = stemp * temp + ctemp * a(i, j);
>  }

Please file a bug with the reduced .cpp testcase.  My wild guess is that this is a failure because we don't have TBAA yet, which isn't being worked on.  What flags are you passing to the compiler?  Anything like -ffast-math?  Note that ifort defaults to "fast and loose" numerics iirc.

-Chris

> 
> For the loop body, g++ (4.2) emits unsurprising code.
> loop:				
> movsd    (%rcx), %xmm2
> movapd   %xmm3, %xmm0			
> mulsd    %xmm2, %xmm0			
> movapd   %xmm4, %xmm1			
> mulsd    (%rax), %xmm1			
> subsd    %xmm1, %xmm0			
> movsd    %xmm0, (%rcx)			
> movapd   %xmm3, %xmm0			
> mulsd    (%rax), %xmm0			
> mulsd    %xmm4, %xmm2			
> addsd    %xmm2, %xmm0			
> movsd    %xmm0, (%rax)			
> incl     %esi			
> addq     $8, %rcx			
> addq     $8, %rax			
> cmpl     %esi, +0(%r13)			
> jge      loop	
> 
> clang++ (2.8) misses major optimizations accessing the 'a' array, and makes no less than 3 laborious address calculations.
> loop:	
> movq     %rax, %rdi
> subq     %rdx, %rdi			
> imulq    %r14, %rdi
> subq     %rcx, %rdi			
> addq     %rsi, %rdi			
> movq     +0(%r13), %r8			
> movsd    (%r8, %rdi, 8), %xmm3			
> mulsd    %xmm1, %xmm3			
> movq     %rbx, %rdi			
> subq     %rdx, %rdi			
> imulq    %r14, %rdi
> subq     %rcx, %rdi			
> addq     %rsi, %rdi			
> movsd    (%r8, %rdi, 8), %xmm4			
> movapd   %xmm2, %xmm5			
> mulsd    %xmm4, %xmm5			
> subsd    %xmm3, %xmm5			
> movsd    %xmm5, (%r8, %rdi, 8)			
> movq     +32(%r13), %rdx			
> movq     %rax, %rdi			
> subq     %rdx, %rdi			
> movq     +0(%r13), %r8			
> movq     +8(%r13), %r14			
> imulq    %r14, %rdi
> movq     +24(%r13), %rcx			
> subq     %rcx, %rdi			
> addq     %rsi, %rdi			
> movsd    (%r8, %rdi, 8), %xmm3			
> mulsd    %xmm2, %xmm3			
> mulsd    %xmm1, %xmm4			
> addsd    %xmm3, %xmm4			
> movsd    %xmm4, (%r8, %rdi, 8)			
> incq     %rsi			
> cmpl     (%r15), %esi			
> jle      loop
> 
> Presumably clang++, in its present state of development, is not smart enough to notice the underlying simple sequential access pattern, when the array is declared
> arr_ref<double, 2> a
> 
> I think clang has no trouble optimizing properly for arrays like this:
> double  a[800][800];
> 
> Robert P.
> 
> 
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev





More information about the cfe-dev mailing list