[LLVMdev] Polyhedron 2005 results for dragonegg 3.3svn
Duncan Sands
duncan.sands at gmail.com
Tue Jun 4 08:34:04 PDT 2013
Hi Shuxin,
On 03/06/13 19:12, Shuxin Yang wrote:
> Actually this kind of opportunities, as outlined bellow, was one of my contrived
> motivating
> example for fast-math. But last year we don't see such opportunities in real
> applications we care about.
>
> t1 = x1/y
> ...
> t2 = x2/y.
>
> I think it is better to be taken care by GVN/PRE -- blindly convert x/y => x
> *1/y is not necessarily
> beneficial. Or maybe we can blindly perform such transformation in early stage,
> and later on
> convert it back if they are not CSEed away.
I've opened PR16218 to track this.
Ciao, Duncan.
>
>
> On 6/3/13 8:53 AM, Duncan Sands wrote:
>> Hi Nadav,
>>
>> On 02/06/13 19:08, Nadav Rotem wrote:
>>> Jack,
>>>
>>> Can you please file a bug report and attach the BC files for the major loops
>>> that we miss ?
>>
>> I took a look and it's not clear what vectorization has to do with it, it seems
>> to be a missed fast-math optimization. I've attached bitcode where only LLVM
>> optimizations are run (fatigue0.ll) and where GCC optimizations are run before
>> LLVM optimizations (fatigue1.ll). The hottest instruction is the same in both:
>>
>> fatigue0.ll:
>> %329 = fsub fast double %327, %328, !dbg !1077
>>
>> fatique1.ll:
>> %1504 = fsub fast double %1501, %1503, !dbg !1148
>>
>> However in the GCC version it is twice as hot as in the LLVM only version,
>> i.e. in the LLVM only version instructions elsewhere are consuming a lot of
>> time. In the LLVM only version there are 9 fdiv instructions in that basic
>> block while GCC has only one. From the profile it looks like each of them is
>> consuming quite some time, and all together they chew up a lot of time. I
>> think this explains the speed difference.
>>
>> All of the fdiv's have the same denominator:
>> %260 = fdiv fast double %253, %259
>> ...
>> %262 = fdiv fast double %219, %259
>> ...
>> %264 = fdiv fast double %224, %259
>> ...
>> %266 = fdiv fast double %230, %259
>> and so on. It looks like GCC takes the reciprocal
>> %1445 = fdiv fast double 1.000000e+00, %1439
>> and then turns the fdiv's into fmul's.
>>
>> I'm not sure what the best way to implement this optimization in LLVM is. Maybe
>> Shuxin has some ideas.
>>
>> So it looks like a missed fast-math optimization rather than anything to do with
>> vectorization, which is strange as GCC only gets the big speedup when
>> vectorization is turned on.
>>
>> Ciao, Duncan.
>>
>>>
>>> Thanks,
>>> Nadav
>>>
>>>
>>> On Jun 2, 2013, at 1:27, Duncan Sands <duncan.sands at gmail.com
>>> <mailto:duncan.sands at gmail.com>> wrote:
>>>
>>>> Hi Jack, thanks for splitting out what the effects of LLVM's / GCC's
>>>> vectorizers
>>>> is.
>>>>
>>>> On 01/06/13 21:34, Jack Howarth wrote:
>>>>> On Sat, Jun 01, 2013 at 06:45:48AM +0200, Duncan Sands wrote:
>>>>>>
>>>>>> These results are very disappointing, I was hoping to see a big improvement
>>>>>> somewhere instead of no real improvement anywhere (except for gas_dyn) or a
>>>>>> regression (eg: mdbx). I think LLVM now has a reasonable array of fast-math
>>>>>> optimizations. I will try to find time to poke at gas_dyn and induct: since
>>>>>> turning on gcc's optimizations there halve the run-time, LLVM's IR optimizers
>>>>>> are clearly missing something important.
>>>>>>
>>>>>> Ciao, Duncan.
>>>>>
>>>>> Duncan,
>>>>> Appended are another set of benchmark runs where I attempted to decouple
>>>>> the
>>>>> fast math optimizations from the vectorization by passing -fno-tree-vectorize.
>>>>> I am unclear if dragonegg really honors -fno-tree-vectorize to disable the
>>>>> llvm
>>>>> vectorization.
>>>>
>>>> Yes, it does disable LLVM vectorization.
>>>>
>>>>>
>>>>> Tested on x86_apple-darwin12
>>>>>
>>>>> Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize
>>>>
>>>> Maybe -march=native would be a good addition.
>>>>
>>>>>
>>>>> de-gfc48: /sw/lib/gcc4.8/bin/gfortran
>>>>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so
>>>>> -specs=/sw/lib/gcc4.8/lib/integrated-as.specs
>>>>> de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran
>>>>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so
>>>>> -specs=/sw/lib/gcc4.8/lib/integrated-as.spec
>>>>> s -fplugin-arg-dragonegg-enable-gcc-optzns
>>>>> gfortran48: /sw/bin/gfortran-fsf-4.8
>>>>>
>>>>> Run time (secs)
>>>>
>>>> What is the standard deviation for each benchmark? If each run varies by +-5%
>>>> then that means that the changes in runtime of around 3% measured below don't
>>>> mean anything.
>>>>
>>>>
>>>> Comparing with your previous benchmarks, I see:
>>>>
>>>>>
>>>>> Benchmark de-gfc48 de-gfc48 gfortran48
>>>>> +optzns
>>>>>
>>>>> ac 11.33 8.10 8.02
>>>>
>>>> Turning on LLVM's vectorizer gives a 2% slowdown.
>>>>
>>>>> aermod 16.03 14.45 16.13
>>>>
>>>> Turning on LLVM's vectorizer gives a 2.5% slowdown.
>>>>
>>>>> air 6.80 5.28 5.73
>>>>> capacita 39.89 35.21 34.96
>>>>
>>>> Turning on LLVM's vectorizer gives a 5% speedup. GCC gets a 5.5% speedup from
>>>> its vectorizer.
>>>>
>>>>> channel 2.06 2.29 2.69
>>>>
>>>> GCC's gets a 30% speedup from its vectorizer which LLVM doesn't get. On the
>>>> other hand, without vectorization LLVM's version runs 23% faster than GCC's, so
>>>> while GCC's vectorizer leaps GCC into the lead, the final speed difference is
>>>> more in the order of GCC 10% faster.
>>>>
>>>>> doduc 27.35 26.13 25.74
>>>>> fatigue 8.83 4.82 4.67
>>>>
>>>> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.
>>>> This is a good one to look at, because all the difference between GCC
>>>> and LLVM is coming from the mid-level optimizers: turning on GCC optzns
>>>> in dragonegg speeds up the program to GCC levels, so it is possible to
>>>> get LLVM IR with and without the effect of GCC optimizations, which should
>>>> make it fairly easy to understand what GCC is doing right here.
>>>>
>>>>> gas_dyn 11.41 9.79 9.60
>>>>
>>>> Turning on LLVM's vectorizer gives a 30% speedup. GCC gets a comparable
>>>> speedup from its vectorizer.
>>>>
>>>>> induct 23.95 21.75 21.14
>>>>
>>>> GCC's gets a 40% speedup from its vectorizer which LLVM doesn't get. Like
>>>> fatigue, this is a case where we can get IR showing all the improvements that
>>>> the GCC optimizers made.
>>>>
>>>>> linpk 15.49 15.48 15.69
>>>>> mdbx 11.91 11.28 11.39
>>>>
>>>> Turning on LLVM's vectorizer gives a 2% slowdown
>>>>
>>>>> nf 29.92 29.57 27.99
>>>>> protein 36.34 33.94 31.91
>>>>
>>>> Turning on LLVM's vectorizer gives a 3% speedup.
>>>>
>>>>> rnflow 25.97 25.27 22.78
>>>>
>>>> GCC's gets a 7% speedup from its vectorizer which LLVM doesn't get.
>>>>
>>>>> test_fpu 11.48 10.91 9.64
>>>>
>>>> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.
>>>>
>>>>> tfft 1.92 1.91 1.91
>>>>>
>>>>> Geom. Mean 13.12 11.70 11.64
>>>>
>>>> Ciao, Duncan.
>>>>
>>>>>
>>>>> Assuming that the de-gfc48+optzns run really has disabled the llvm
>>>>> vectorization,
>>>>> I am hoping that additional benchmarking of de-gfc48+optzns with individual
>>>>> -ffast-math optimizations disabled (such as passing
>>>>> -fno-unsafe-math-optimizations)
>>>>> may give us a clue as the the origin of the performance delta between the
>>>>> stock
>>>>> dragonegg results with -ffast-math and those with
>>>>> -fplugin-arg-dragonegg-enable-gcc-optzns.
>>>>> Jack
>>>>>
>>>>
>>
>
>
More information about the llvm-dev
mailing list