[cfe-dev] food for optimizer developers

Wed Aug 11 03:18:20 PDT 2010

Douglas Gregor wrote:

>>> I wrote a Fortran to C++ conversion program that I used to convert selected
>>> LAPACK sources. Comparing runtimes with different compilers I get:
>>> 
>>>                        absolute  relative
>>> ifort 11.1.072             1.790s     1.00
>>> gfortran 4.4.4             2.470s     1.38
>>> g++ 4.4.4                  2.922s     1.63
>>> clang++ 2.8 (trunk 108205) 6.487s     3.62
>> 
>>> - Why is the code generated by clang++ so much slower than the g++ code?
>> 
>> A "hot spot" in your benchmark dsyev_test.cpp is this loop in dlasr()
>> 
>> FEM_DO(i, 1, m) {
>>   temp = a(i, j + 1);
>>   a(i, j + 1) = ctemp * temp - stemp * a(i, j);
>>   a(i, j) = stemp * temp + ctemp * a(i, j);
>> }
>> 
>> For the loop body, g++ (4.2) emits unsurprising code.
>> 

>> clang++ (2.8) misses major optimizations accessing the 'a' array, and makes no less than 3 laborious address calculations.
>> 

>> Presumably clang++, in its present state of development, is not smart enough to notice the underlying simple sequential access pattern, when the array is declared
>> arr_ref<double, 2> a
> 
> This would make a *wonderful* bug report against the LLVM optimizer... http://llvm.org/bugs/ :)

I believe that would require the cooperation of the OP, because it is his Fortran -> C++ converter. Are you interested, Ralf?
I've started the ball rolling with a much reduced test case.

cat test.cpp
/*
 Background:
 <http://lists.cs.uiuc.edu/pipermail/cfe-dev/2010-August/010258.html>

 Relevant files, including benchmark dsyev_test.cpp:
 <http://cci.lbl.gov/lapack_fem/>

 This file (test.cpp) is a reduced case of dsyev_test.cpp.
 It sheds light on the performance issue with clang++.

 $ clang++ -c -I. -O3 test.cpp -save-temps

 Examine test.s, in which the two inner loops of interest
 are easily identified by their 'subsd' instruction.
 Contrary to expectation, assembly code for loops A and B 
 is different. Loop B contains laborious and redundant 
 address calculations.

 clang --version
 clang version 2.8 (trunk 110653)

 By contrast, g++ (4.2) emits identical assembler for loops A and B.
 */

#include <fem/major_types.hpp>

namespace lapack_dsyev_fem {

  using namespace fem::major_types;

  void
  test(
     int x,
     int const& m,
     int const& n,
     arr_cref<double> c,
     arr_cref<double> s,
     arr_ref<double, 2> a,
     int const& lda)
  {
    c(dimension(star));
    s(dimension(star));
    a(dimension(lda, star));

    int i, j;
    double ctemp, stemp, temp;

    if ( x ) {
      for ( j = m - 1; j >= 1; j-- ) {
        ctemp = c(j);
        stemp = s(j);
      // loop A, identical with loop B below
        for ( i = 1; i <= n; i++ ) {
          temp = a(j + 1, i);
          a(j + 1, i) = ctemp * temp - stemp * a(j, i);
          a(j, i) = stemp * temp + ctemp * a(j, i);
        }
      }
    }
    else  {
      for ( j = m - 1; j >= 1; j-- ) {
        ctemp = c(j);
        stemp = s(j);
        // loop B, identical with loop A above
        for ( i = 1; i <= n; i++ ) {
          temp = a(j + 1, i);
          a(j + 1, i) = ctemp * temp - stemp * a(j, i);
          a(j, i) = stemp * temp + ctemp * a(j, i);
        }
      }
    }   
  }

} // namespace lapack_dsyev_fem

Robert P.