[cfe-dev] Clang build of ATLAS (and speed comparison)

Thu Sep 8 02:22:01 PDT 2011

Hi there,

together with Clint Whaley, the author and maintainer of the ATLAS suite, we are currently evaluating clang/llvm performance to support the clang compiler for ATLAS building, besides gcc, in the scope (on my side) to add this feature to the Macports port.

As a preliminary report, here is what spits out the ATLAS built-in benchmark (‘make time’) after a compilation with the -Oz flag. Reference stands for an installation presumably with GCC4.x on Linux (Clint, could you elaborate on this?). Machine is a 3-year old MacBook (late 2008, Core 2 Duo), compiler is the version of clang shipped with latest Xcode 4.2:

> Apple clang version 3.0 (tags/Apple/clang-211.9) (based on LLVM 3.0svn)

Dragonegg with gcc4.5 is used for fortran compilation, but I don’t think it is relevant here (again,  Clint, could you confirm this ?). Anyway, it is pure LLVM output code.

---
The times labeled Reference are for ATLAS as installed by the authors.
NAMING ABBREVIATIONS:
   kSelMM : selected matmul kernel (may be hand-tuned)
   kGenMM : generated matmul kernel
   kMM_NT : worst no-copy kernel
   kMM_TN : best no-copy kernel
   BIG_MM : large GEMM timing (usually N=1600); estimate of asymptotic peak
   kMV_N  : NoTranspose matvec kernel
   kMV_T  : Transpose matvec kernel
   kGER   : GER (rank-1 update) kernel
Kernel routines are not called by the user directly, and their
performance is often somewhat different than the total
algorithm (eg, dGER perf may differ from dkGER)

Reference clock rate=2394Mhz, new rate=2400Mhz
   Refrenc : % of clock rate achieved by reference install
   Present : % of clock rate achieved by present ATLAS install

                    single precision                  double precision
            ********************************   *******************************
                  real           complex           real           complex
            ---------------  ---------------  ---------------  ---------------
Benchmark   Refrenc Present  Refrenc Present  Refrenc Present  Refrenc Present
=========   ======= =======  ======= =======  ======= =======  ======= =======
  kSelMM      646.0   608.2    611.2   571.5    369.1   368.5    355.4   356.5
  kGenMM      185.1   154.5    183.0   153.1    173.7   147.6    176.1   151.1
  kMM_NT      172.7   145.6    166.6   140.9    164.0   132.5    174.6   140.0
  kMM_TN      185.8   150.8    178.6   148.5    174.1   114.3    167.9   132.5
  BIG_MM      603.3   582.4    600.5   590.2    351.2   354.8    341.8   348.9
   kMV_N       74.5   100.5    147.7   161.1     43.8    54.3     88.1    98.5
   kMV_T       70.6   101.4    156.2   176.5     48.7    53.7     90.5   104.4
    kGER       39.5    46.6     81.7    86.8     20.4    26.9     40.4    52.8
---

What we see is that, while clang seems to outperform GCC on level2 BLAS ops (matrix • vector), it is consistently 20 % inferior on level3 ops (lines 2, 3 and 4). 

The -O0 produces an oddity we are currently investigating, so I will not report the results now. -O3 figures are about the same. This is summarized in the following table, with both clang and gcc compilations done on the same MacBook machine previously mentioned (-O2 was used for gcc):

                  single precision                  double precision
          ********************************   *******************************
                real           complex           real           complex
          ---------------  ---------------  ---------------  ---------------
Benchmark   Clang   GCC4.5   Clang   GCC4.5   Clang   GCC4.5   Clang   GCC4.5
=========   ======  ======  ======   =====    =====   =====    =====   =====
kSelMM      585.5   592.9    556.9   630.0    368.4   368.6    356.4   359.2
kGenMM      154.4   180.4    153.1   183.4    147.7   165.6    159.4   172.5
kMM_NT      146.3   165.7    146.0   165.4    132.3   138.9    139.9   161.7
kMM_TN      150.9   181.2    149.9   180.8    116.3   168.1    134.2   165.8
BIG_MM      572.7   589.4    581.7   550.2    354.7   353.9    347.8   347.1
 kMV_N       96.9    83.1    164.2   160.3     53.7    48.3    102.7    97.7
 kMV_T       96.3    75.2    173.1   166.2     53.2    55.1     98.2    97.9
  kGER       46.9    41.8     86.3    85.5     26.9    21.7     53.0    42.0

Please note, and this is also important, that neither at -O0, nor at -O3 does clang seem to produce correct code: both version fails the ATLAS sanity checks, either with wrong results (at -O0) or with crashes at (-O3). -Oz does qualify, though.

Vincent