[LLVMdev] Polyhedron 2005 results for dragonegg 3.3svn

Tue Jun 4 08:34:04 PDT 2013

Hi Shuxin,

On 03/06/13 19:12, Shuxin Yang wrote:
> Actually this kind of opportunities, as outlined bellow, was one of my contrived
> motivating
> example for fast-math. But last year we don't see such opportunities in real
> applications we care about.
>
>      t1 = x1/y
>      ...
>      t2 = x2/y.
>
>   I think it is better to be taken care by GVN/PRE -- blindly convert x/y => x
> *1/y is not necessarily
> beneficial. Or maybe we can blindly perform such transformation in early stage,
> and later on
> convert it back if they are not CSEed away.

I've opened PR16218 to track this.

Ciao, Duncan.
>
>
> On 6/3/13 8:53 AM, Duncan Sands wrote:
>> Hi Nadav,
>>
>> On 02/06/13 19:08, Nadav Rotem wrote:
>>> Jack,
>>>
>>> Can you please file a bug report and attach the BC files for the major loops
>>> that we miss ?
>>
>> I took a look and it's not clear what vectorization has to do with it, it seems
>> to be a missed fast-math optimization.  I've attached bitcode where only LLVM
>> optimizations are run (fatigue0.ll) and where GCC optimizations are run before
>> LLVM optimizations (fatigue1.ll).  The hottest instruction is the same in both:
>>
>> fatigue0.ll:
>>   %329 = fsub fast double %327, %328, !dbg !1077
>>
>> fatique1.ll:
>>   %1504 = fsub fast double %1501, %1503, !dbg !1148
>>
>> However in the GCC version it is twice as hot as in the LLVM only version,
>> i.e. in the LLVM only version instructions elsewhere are consuming a lot of
>> time.  In the LLVM only version there are 9 fdiv instructions in that basic
>> block while GCC has only one.  From the profile it looks like each of them is
>> consuming quite some time, and all together they chew up a lot of time.  I
>> think this explains the speed difference.
>>
>> All of the fdiv's have the same denominator:
>>   %260 = fdiv fast double %253, %259
>> ...
>>   %262 = fdiv fast double %219, %259
>> ...
>>   %264 = fdiv fast double %224, %259
>> ...
>>   %266 = fdiv fast double %230, %259
>> and so on.  It looks like GCC takes the reciprocal
>>   %1445 = fdiv fast double 1.000000e+00, %1439
>> and then turns the fdiv's into fmul's.
>>
>> I'm not sure what the best way to implement this optimization in LLVM is.  Maybe
>> Shuxin has some ideas.
>>
>> So it looks like a missed fast-math optimization rather than anything to do with
>> vectorization, which is strange as GCC only gets the big speedup when
>> vectorization is turned on.
>>
>> Ciao, Duncan.
>>
>>>
>>> Thanks,
>>> Nadav
>>>
>>>
>>> On Jun 2, 2013, at 1:27, Duncan Sands <duncan.sands at gmail.com
>>> <mailto:duncan.sands at gmail.com>> wrote:
>>>
>>>> Hi Jack, thanks for splitting out what the effects of LLVM's / GCC's
>>>> vectorizers
>>>> is.
>>>>
>>>> On 01/06/13 21:34, Jack Howarth wrote:
>>>>> On Sat, Jun 01, 2013 at 06:45:48AM +0200, Duncan Sands wrote:
>>>>>>
>>>>>> These results are very disappointing, I was hoping to see a big improvement
>>>>>> somewhere instead of no real improvement anywhere (except for gas_dyn) or a
>>>>>> regression (eg: mdbx).  I think LLVM now has a reasonable array of fast-math
>>>>>> optimizations.  I will try to find time to poke at gas_dyn and induct: since
>>>>>> turning on gcc's optimizations there halve the run-time, LLVM's IR optimizers
>>>>>> are clearly missing something important.
>>>>>>
>>>>>> Ciao, Duncan.
>>>>>
>>>>> Duncan,
>>>>>    Appended are another set of benchmark runs where I attempted to decouple
>>>>> the
>>>>> fast math optimizations from the vectorization by passing -fno-tree-vectorize.
>>>>> I am unclear if dragonegg really honors -fno-tree-vectorize to disable the
>>>>> llvm
>>>>> vectorization.
>>>>
>>>> Yes, it does disable LLVM vectorization.
>>>>
>>>>>
>>>>> Tested on x86_apple-darwin12
>>>>>
>>>>> Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize
>>>>
>>>> Maybe -march=native would be a good addition.
>>>>
>>>>>
>>>>> de-gfc48: /sw/lib/gcc4.8/bin/gfortran
>>>>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so
>>>>> -specs=/sw/lib/gcc4.8/lib/integrated-as.specs
>>>>> de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran
>>>>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so
>>>>> -specs=/sw/lib/gcc4.8/lib/integrated-as.spec
>>>>> s -fplugin-arg-dragonegg-enable-gcc-optzns
>>>>> gfortran48: /sw/bin/gfortran-fsf-4.8
>>>>>
>>>>> Run time (secs)
>>>>
>>>> What is the standard deviation for each benchmark?  If each run varies by +-5%
>>>> then that means that the changes in runtime of around 3% measured below don't
>>>> mean anything.
>>>>
>>>>
>>>> Comparing with your previous benchmarks, I see:
>>>>
>>>>>
>>>>> Benchmark     de-gfc48  de-gfc48   gfortran48
>>>>>                         +optzns
>>>>>
>>>>> ac             11.33      8.10       8.02
>>>>
>>>> Turning on LLVM's vectorizer gives a 2% slowdown.
>>>>
>>>>> aermod         16.03     14.45 16.13
>>>>
>>>> Turning on LLVM's vectorizer gives a 2.5% slowdown.
>>>>
>>>>> air             6.80      5.28 5.73
>>>>> capacita       39.89     35.21      34.96
>>>>
>>>> Turning on LLVM's vectorizer gives a 5% speedup.  GCC gets a 5.5% speedup from
>>>> its vectorizer.
>>>>
>>>>> channel         2.06      2.29 2.69
>>>>
>>>> GCC's gets a 30% speedup from its vectorizer which LLVM doesn't get.  On the
>>>> other hand, without vectorization LLVM's version runs 23% faster than GCC's, so
>>>> while GCC's vectorizer leaps GCC into the lead, the final speed difference is
>>>> more in the order of GCC 10% faster.
>>>>
>>>>> doduc          27.35     26.13 25.74
>>>>> fatigue         8.83      4.82       4.67
>>>>
>>>> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.
>>>> This is a good one to look at, because all the difference between GCC
>>>> and LLVM is coming from the mid-level optimizers: turning on GCC optzns
>>>> in dragonegg speeds up the program to GCC levels, so it is possible to
>>>> get LLVM IR with and without the effect of GCC optimizations, which should
>>>> make it fairly easy to understand what GCC is doing right here.
>>>>
>>>>> gas_dyn        11.41      9.79 9.60
>>>>
>>>> Turning on LLVM's vectorizer gives a 30% speedup.  GCC gets a comparable
>>>> speedup from its vectorizer.
>>>>
>>>>> induct         23.95     21.75 21.14
>>>>
>>>> GCC's gets a 40% speedup from its vectorizer which LLVM doesn't get.  Like
>>>> fatigue, this is a case where we can get IR showing all the improvements that
>>>> the GCC optimizers made.
>>>>
>>>>> linpk          15.49     15.48 15.69
>>>>> mdbx           11.91     11.28      11.39
>>>>
>>>> Turning on LLVM's vectorizer gives a 2% slowdown
>>>>
>>>>> nf             29.92     29.57 27.99
>>>>> protein        36.34     33.94      31.91
>>>>
>>>> Turning on LLVM's vectorizer gives a 3% speedup.
>>>>
>>>>> rnflow         25.97     25.27 22.78
>>>>
>>>> GCC's gets a 7% speedup from its vectorizer which LLVM doesn't get.
>>>>
>>>>> test_fpu       11.48     10.91 9.64
>>>>
>>>> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.
>>>>
>>>>> tfft            1.92      1.91 1.91
>>>>>
>>>>> Geom. Mean     13.12     11.70      11.64
>>>>
>>>> Ciao, Duncan.
>>>>
>>>>>
>>>>> Assuming that the de-gfc48+optzns run really has disabled the llvm
>>>>> vectorization,
>>>>> I am hoping that additional benchmarking of de-gfc48+optzns with individual
>>>>> -ffast-math optimizations disabled (such as passing
>>>>> -fno-unsafe-math-optimizations)
>>>>> may give us a clue as the the origin of the performance delta between the
>>>>> stock
>>>>> dragonegg results with -ffast-math and those with
>>>>> -fplugin-arg-dragonegg-enable-gcc-optzns.
>>>>>       Jack
>>>>>
>>>>
>>
>
>