[LLVMdev] Polyhedron 2005 results for dragonegg 3.3svn

Mon Jun 3 13:00:04 PDT 2013

[Resending without the bitcode attached, which was too big for the mailing
list].

Hi Nadav,

On 02/06/13 19:08, Nadav Rotem wrote:
> Jack,
>
> Can you please file a bug report and attach the BC files for the major loops
> that we miss ?

I took a look and it's not clear what vectorization has to do with it, it seems
to be a missed fast-math optimization.  I've attached bitcode where only LLVM
optimizations are run (fatigue0.ll) and where GCC optimizations are run before
LLVM optimizations (fatigue1.ll).  The hottest instruction is the same in both:

fatigue0.ll:
    %329 = fsub fast double %327, %328, !dbg !1077

fatique1.ll:
    %1504 = fsub fast double %1501, %1503, !dbg !1148

However in the GCC version it is twice as hot as in the LLVM only version,
i.e. in the LLVM only version instructions elsewhere are consuming a lot of
time.  In the LLVM only version there are 9 fdiv instructions in that basic
block while GCC has only one.  From the profile it looks like each of them is
consuming quite some time, and all together they chew up a lot of time.  I
think this explains the speed difference.

All of the fdiv's have the same denominator:
    %260 = fdiv fast double %253, %259
...
    %262 = fdiv fast double %219, %259
...
    %264 = fdiv fast double %224, %259
...
    %266 = fdiv fast double %230, %259
and so on.  It looks like GCC takes the reciprocal
    %1445 = fdiv fast double 1.000000e+00, %1439
and then turns the fdiv's into fmul's.

I'm not sure what the best way to implement this optimization in LLVM is.  Maybe
Shuxin has some ideas.

So it looks like a missed fast-math optimization rather than anything to do with
vectorization, which is strange as GCC only gets the big speedup when
vectorization is turned on.

Ciao, Duncan.

>
> Thanks,
> Nadav
>
>
> On Jun 2, 2013, at 1:27, Duncan Sands <duncan.sands at gmail.com
> <mailto:duncan.sands at gmail.com>> wrote:
>
>> Hi Jack, thanks for splitting out what the effects of LLVM's / GCC's vectorizers
>> is.
>>
>> On 01/06/13 21:34, Jack Howarth wrote:
>>> On Sat, Jun 01, 2013 at 06:45:48AM +0200, Duncan Sands wrote:
>>>>
>>>> These results are very disappointing, I was hoping to see a big improvement
>>>> somewhere instead of no real improvement anywhere (except for gas_dyn) or a
>>>> regression (eg: mdbx).  I think LLVM now has a reasonable array of fast-math
>>>> optimizations.  I will try to find time to poke at gas_dyn and induct: since
>>>> turning on gcc's optimizations there halve the run-time, LLVM's IR optimizers
>>>> are clearly missing something important.
>>>>
>>>> Ciao, Duncan.
>>>
>>> Duncan,
>>>    Appended are another set of benchmark runs where I attempted to decouple the
>>> fast math optimizations from the vectorization by passing -fno-tree-vectorize.
>>> I am unclear if dragonegg really honors -fno-tree-vectorize to disable the llvm
>>> vectorization.
>>
>> Yes, it does disable LLVM vectorization.
>>
>>>
>>> Tested on x86_apple-darwin12
>>>
>>> Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize
>>
>> Maybe -march=native would be a good addition.
>>
>>>
>>> de-gfc48: /sw/lib/gcc4.8/bin/gfortran
>>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so
>>> -specs=/sw/lib/gcc4.8/lib/integrated-as.specs
>>> de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran
>>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so
>>> -specs=/sw/lib/gcc4.8/lib/integrated-as.spec
>>> s -fplugin-arg-dragonegg-enable-gcc-optzns
>>> gfortran48: /sw/bin/gfortran-fsf-4.8
>>>
>>> Run time (secs)
>>
>> What is the standard deviation for each benchmark?  If each run varies by +-5%
>> then that means that the changes in runtime of around 3% measured below don't
>> mean anything.
>>
>>
>> Comparing with your previous benchmarks, I see:
>>
>>>
>>> Benchmark     de-gfc48  de-gfc48   gfortran48
>>>                         +optzns
>>>
>>> ac             11.33      8.10       8.02
>>
>> Turning on LLVM's vectorizer gives a 2% slowdown.
>>
>>> aermod         16.03     14.45      16.13
>>
>> Turning on LLVM's vectorizer gives a 2.5% slowdown.
>>
>>> air             6.80      5.28       5.73
>>> capacita       39.89     35.21      34.96
>>
>> Turning on LLVM's vectorizer gives a 5% speedup.  GCC gets a 5.5% speedup from
>> its vectorizer.
>>
>>> channel         2.06      2.29       2.69
>>
>> GCC's gets a 30% speedup from its vectorizer which LLVM doesn't get.  On the
>> other hand, without vectorization LLVM's version runs 23% faster than GCC's, so
>> while GCC's vectorizer leaps GCC into the lead, the final speed difference is
>> more in the order of GCC 10% faster.
>>
>>> doduc          27.35     26.13      25.74
>>> fatigue         8.83      4.82       4.67
>>
>> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.
>> This is a good one to look at, because all the difference between GCC
>> and LLVM is coming from the mid-level optimizers: turning on GCC optzns
>> in dragonegg speeds up the program to GCC levels, so it is possible to
>> get LLVM IR with and without the effect of GCC optimizations, which should
>> make it fairly easy to understand what GCC is doing right here.
>>
>>> gas_dyn        11.41      9.79       9.60
>>
>> Turning on LLVM's vectorizer gives a 30% speedup.  GCC gets a comparable
>> speedup from its vectorizer.
>>
>>> induct         23.95     21.75      21.14
>>
>> GCC's gets a 40% speedup from its vectorizer which LLVM doesn't get.  Like
>> fatigue, this is a case where we can get IR showing all the improvements that
>> the GCC optimizers made.
>>
>>> linpk          15.49     15.48      15.69
>>> mdbx           11.91     11.28      11.39
>>
>> Turning on LLVM's vectorizer gives a 2% slowdown
>>
>>> nf             29.92     29.57      27.99
>>> protein        36.34     33.94      31.91
>>
>> Turning on LLVM's vectorizer gives a 3% speedup.
>>
>>> rnflow         25.97     25.27      22.78
>>
>> GCC's gets a 7% speedup from its vectorizer which LLVM doesn't get.
>>
>>> test_fpu       11.48     10.91       9.64
>>
>> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.
>>
>>> tfft            1.92      1.91       1.91
>>>
>>> Geom. Mean     13.12     11.70      11.64
>>
>> Ciao, Duncan.
>>
>>>
>>> Assuming that the de-gfc48+optzns run really has disabled the llvm vectorization,
>>> I am hoping that additional benchmarking of de-gfc48+optzns with individual
>>> -ffast-math optimizations disabled (such as passing
>>> -fno-unsafe-math-optimizations)
>>> may give us a clue as the the origin of the performance delta between the stock
>>> dragonegg results with -ffast-math and those with
>>> -fplugin-arg-dragonegg-enable-gcc-optzns.
>>>       Jack
>>>
>>