[LLVMdev] Polyhedron 2005 results for dragonegg 3.3svn

Mon Jun 3 10:12:59 PDT 2013

Actually this kind of opportunities, as outlined bellow, was one of my 
contrived motivating
example for fast-math. But last year we don't see such opportunities in 
real applications we care about.

     t1 = x1/y
     ...
     t2 = x2/y.

  I think it is better to be taken care by GVN/PRE -- blindly convert 
x/y => x *1/y is not necessarily
beneficial. Or maybe we can blindly perform such transformation in early 
stage, and later on
convert it back if they are not CSEed away.

On 6/3/13 8:53 AM, Duncan Sands wrote:
> Hi Nadav,
>
> On 02/06/13 19:08, Nadav Rotem wrote:
>> Jack,
>>
>> Can you please file a bug report and attach the BC files for the 
>> major loops
>> that we miss ?
>
> I took a look and it's not clear what vectorization has to do with it, 
> it seems
> to be a missed fast-math optimization.  I've attached bitcode where 
> only LLVM
> optimizations are run (fatigue0.ll) and where GCC optimizations are 
> run before
> LLVM optimizations (fatigue1.ll).  The hottest instruction is the same 
> in both:
>
> fatigue0.ll:
>   %329 = fsub fast double %327, %328, !dbg !1077
>
> fatique1.ll:
>   %1504 = fsub fast double %1501, %1503, !dbg !1148
>
> However in the GCC version it is twice as hot as in the LLVM only 
> version,
> i.e. in the LLVM only version instructions elsewhere are consuming a 
> lot of
> time.  In the LLVM only version there are 9 fdiv instructions in that 
> basic
> block while GCC has only one.  From the profile it looks like each of 
> them is
> consuming quite some time, and all together they chew up a lot of 
> time.  I
> think this explains the speed difference.
>
> All of the fdiv's have the same denominator:
>   %260 = fdiv fast double %253, %259
> ...
>   %262 = fdiv fast double %219, %259
> ...
>   %264 = fdiv fast double %224, %259
> ...
>   %266 = fdiv fast double %230, %259
> and so on.  It looks like GCC takes the reciprocal
>   %1445 = fdiv fast double 1.000000e+00, %1439
> and then turns the fdiv's into fmul's.
>
> I'm not sure what the best way to implement this optimization in LLVM 
> is.  Maybe
> Shuxin has some ideas.
>
> So it looks like a missed fast-math optimization rather than anything 
> to do with
> vectorization, which is strange as GCC only gets the big speedup when
> vectorization is turned on.
>
> Ciao, Duncan.
>
>>
>> Thanks,
>> Nadav
>>
>>
>> On Jun 2, 2013, at 1:27, Duncan Sands <duncan.sands at gmail.com
>> <mailto:duncan.sands at gmail.com>> wrote:
>>
>>> Hi Jack, thanks for splitting out what the effects of LLVM's / GCC's 
>>> vectorizers
>>> is.
>>>
>>> On 01/06/13 21:34, Jack Howarth wrote:
>>>> On Sat, Jun 01, 2013 at 06:45:48AM +0200, Duncan Sands wrote:
>>>>>
>>>>> These results are very disappointing, I was hoping to see a big 
>>>>> improvement
>>>>> somewhere instead of no real improvement anywhere (except for 
>>>>> gas_dyn) or a
>>>>> regression (eg: mdbx).  I think LLVM now has a reasonable array of 
>>>>> fast-math
>>>>> optimizations.  I will try to find time to poke at gas_dyn and 
>>>>> induct: since
>>>>> turning on gcc's optimizations there halve the run-time, LLVM's IR 
>>>>> optimizers
>>>>> are clearly missing something important.
>>>>>
>>>>> Ciao, Duncan.
>>>>
>>>> Duncan,
>>>>    Appended are another set of benchmark runs where I attempted to 
>>>> decouple the
>>>> fast math optimizations from the vectorization by passing 
>>>> -fno-tree-vectorize.
>>>> I am unclear if dragonegg really honors -fno-tree-vectorize to 
>>>> disable the llvm
>>>> vectorization.
>>>
>>> Yes, it does disable LLVM vectorization.
>>>
>>>>
>>>> Tested on x86_apple-darwin12
>>>>
>>>> Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize
>>>
>>> Maybe -march=native would be a good addition.
>>>
>>>>
>>>> de-gfc48: /sw/lib/gcc4.8/bin/gfortran
>>>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so
>>>> -specs=/sw/lib/gcc4.8/lib/integrated-as.specs
>>>> de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran
>>>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so
>>>> -specs=/sw/lib/gcc4.8/lib/integrated-as.spec
>>>> s -fplugin-arg-dragonegg-enable-gcc-optzns
>>>> gfortran48: /sw/bin/gfortran-fsf-4.8
>>>>
>>>> Run time (secs)
>>>
>>> What is the standard deviation for each benchmark?  If each run 
>>> varies by +-5%
>>> then that means that the changes in runtime of around 3% measured 
>>> below don't
>>> mean anything.
>>>
>>>
>>> Comparing with your previous benchmarks, I see:
>>>
>>>>
>>>> Benchmark     de-gfc48  de-gfc48   gfortran48
>>>>                         +optzns
>>>>
>>>> ac             11.33      8.10       8.02
>>>
>>> Turning on LLVM's vectorizer gives a 2% slowdown.
>>>
>>>> aermod         16.03     14.45 16.13
>>>
>>> Turning on LLVM's vectorizer gives a 2.5% slowdown.
>>>
>>>> air             6.80      5.28 5.73
>>>> capacita       39.89     35.21      34.96
>>>
>>> Turning on LLVM's vectorizer gives a 5% speedup.  GCC gets a 5.5% 
>>> speedup from
>>> its vectorizer.
>>>
>>>> channel         2.06      2.29 2.69
>>>
>>> GCC's gets a 30% speedup from its vectorizer which LLVM doesn't 
>>> get.  On the
>>> other hand, without vectorization LLVM's version runs 23% faster 
>>> than GCC's, so
>>> while GCC's vectorizer leaps GCC into the lead, the final speed 
>>> difference is
>>> more in the order of GCC 10% faster.
>>>
>>>> doduc          27.35     26.13 25.74
>>>> fatigue         8.83      4.82       4.67
>>>
>>> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.
>>> This is a good one to look at, because all the difference between GCC
>>> and LLVM is coming from the mid-level optimizers: turning on GCC optzns
>>> in dragonegg speeds up the program to GCC levels, so it is possible to
>>> get LLVM IR with and without the effect of GCC optimizations, which 
>>> should
>>> make it fairly easy to understand what GCC is doing right here.
>>>
>>>> gas_dyn        11.41      9.79 9.60
>>>
>>> Turning on LLVM's vectorizer gives a 30% speedup.  GCC gets a 
>>> comparable
>>> speedup from its vectorizer.
>>>
>>>> induct         23.95     21.75 21.14
>>>
>>> GCC's gets a 40% speedup from its vectorizer which LLVM doesn't 
>>> get.  Like
>>> fatigue, this is a case where we can get IR showing all the 
>>> improvements that
>>> the GCC optimizers made.
>>>
>>>> linpk          15.49     15.48 15.69
>>>> mdbx           11.91     11.28      11.39
>>>
>>> Turning on LLVM's vectorizer gives a 2% slowdown
>>>
>>>> nf             29.92     29.57 27.99
>>>> protein        36.34     33.94      31.91
>>>
>>> Turning on LLVM's vectorizer gives a 3% speedup.
>>>
>>>> rnflow         25.97     25.27 22.78
>>>
>>> GCC's gets a 7% speedup from its vectorizer which LLVM doesn't get.
>>>
>>>> test_fpu       11.48     10.91 9.64
>>>
>>> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.
>>>
>>>> tfft            1.92      1.91 1.91
>>>>
>>>> Geom. Mean     13.12     11.70      11.64
>>>
>>> Ciao, Duncan.
>>>
>>>>
>>>> Assuming that the de-gfc48+optzns run really has disabled the llvm 
>>>> vectorization,
>>>> I am hoping that additional benchmarking of de-gfc48+optzns with 
>>>> individual
>>>> -ffast-math optimizations disabled (such as passing
>>>> -fno-unsafe-math-optimizations)
>>>> may give us a clue as the the origin of the performance delta 
>>>> between the stock
>>>> dragonegg results with -ffast-math and those with
>>>> -fplugin-arg-dragonegg-enable-gcc-optzns.
>>>>       Jack
>>>>
>>>
>