[LLVMdev] Polyhedron 2005 results for dragonegg 3.3svn
Shuxin Yang
shuxin.llvm at gmail.com
Mon Jun 3 10:12:59 PDT 2013
Actually this kind of opportunities, as outlined bellow, was one of my
contrived motivating
example for fast-math. But last year we don't see such opportunities in
real applications we care about.
t1 = x1/y
...
t2 = x2/y.
I think it is better to be taken care by GVN/PRE -- blindly convert
x/y => x *1/y is not necessarily
beneficial. Or maybe we can blindly perform such transformation in early
stage, and later on
convert it back if they are not CSEed away.
On 6/3/13 8:53 AM, Duncan Sands wrote:
> Hi Nadav,
>
> On 02/06/13 19:08, Nadav Rotem wrote:
>> Jack,
>>
>> Can you please file a bug report and attach the BC files for the
>> major loops
>> that we miss ?
>
> I took a look and it's not clear what vectorization has to do with it,
> it seems
> to be a missed fast-math optimization. I've attached bitcode where
> only LLVM
> optimizations are run (fatigue0.ll) and where GCC optimizations are
> run before
> LLVM optimizations (fatigue1.ll). The hottest instruction is the same
> in both:
>
> fatigue0.ll:
> %329 = fsub fast double %327, %328, !dbg !1077
>
> fatique1.ll:
> %1504 = fsub fast double %1501, %1503, !dbg !1148
>
> However in the GCC version it is twice as hot as in the LLVM only
> version,
> i.e. in the LLVM only version instructions elsewhere are consuming a
> lot of
> time. In the LLVM only version there are 9 fdiv instructions in that
> basic
> block while GCC has only one. From the profile it looks like each of
> them is
> consuming quite some time, and all together they chew up a lot of
> time. I
> think this explains the speed difference.
>
> All of the fdiv's have the same denominator:
> %260 = fdiv fast double %253, %259
> ...
> %262 = fdiv fast double %219, %259
> ...
> %264 = fdiv fast double %224, %259
> ...
> %266 = fdiv fast double %230, %259
> and so on. It looks like GCC takes the reciprocal
> %1445 = fdiv fast double 1.000000e+00, %1439
> and then turns the fdiv's into fmul's.
>
> I'm not sure what the best way to implement this optimization in LLVM
> is. Maybe
> Shuxin has some ideas.
>
> So it looks like a missed fast-math optimization rather than anything
> to do with
> vectorization, which is strange as GCC only gets the big speedup when
> vectorization is turned on.
>
> Ciao, Duncan.
>
>>
>> Thanks,
>> Nadav
>>
>>
>> On Jun 2, 2013, at 1:27, Duncan Sands <duncan.sands at gmail.com
>> <mailto:duncan.sands at gmail.com>> wrote:
>>
>>> Hi Jack, thanks for splitting out what the effects of LLVM's / GCC's
>>> vectorizers
>>> is.
>>>
>>> On 01/06/13 21:34, Jack Howarth wrote:
>>>> On Sat, Jun 01, 2013 at 06:45:48AM +0200, Duncan Sands wrote:
>>>>>
>>>>> These results are very disappointing, I was hoping to see a big
>>>>> improvement
>>>>> somewhere instead of no real improvement anywhere (except for
>>>>> gas_dyn) or a
>>>>> regression (eg: mdbx). I think LLVM now has a reasonable array of
>>>>> fast-math
>>>>> optimizations. I will try to find time to poke at gas_dyn and
>>>>> induct: since
>>>>> turning on gcc's optimizations there halve the run-time, LLVM's IR
>>>>> optimizers
>>>>> are clearly missing something important.
>>>>>
>>>>> Ciao, Duncan.
>>>>
>>>> Duncan,
>>>> Appended are another set of benchmark runs where I attempted to
>>>> decouple the
>>>> fast math optimizations from the vectorization by passing
>>>> -fno-tree-vectorize.
>>>> I am unclear if dragonegg really honors -fno-tree-vectorize to
>>>> disable the llvm
>>>> vectorization.
>>>
>>> Yes, it does disable LLVM vectorization.
>>>
>>>>
>>>> Tested on x86_apple-darwin12
>>>>
>>>> Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize
>>>
>>> Maybe -march=native would be a good addition.
>>>
>>>>
>>>> de-gfc48: /sw/lib/gcc4.8/bin/gfortran
>>>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so
>>>> -specs=/sw/lib/gcc4.8/lib/integrated-as.specs
>>>> de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran
>>>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so
>>>> -specs=/sw/lib/gcc4.8/lib/integrated-as.spec
>>>> s -fplugin-arg-dragonegg-enable-gcc-optzns
>>>> gfortran48: /sw/bin/gfortran-fsf-4.8
>>>>
>>>> Run time (secs)
>>>
>>> What is the standard deviation for each benchmark? If each run
>>> varies by +-5%
>>> then that means that the changes in runtime of around 3% measured
>>> below don't
>>> mean anything.
>>>
>>>
>>> Comparing with your previous benchmarks, I see:
>>>
>>>>
>>>> Benchmark de-gfc48 de-gfc48 gfortran48
>>>> +optzns
>>>>
>>>> ac 11.33 8.10 8.02
>>>
>>> Turning on LLVM's vectorizer gives a 2% slowdown.
>>>
>>>> aermod 16.03 14.45 16.13
>>>
>>> Turning on LLVM's vectorizer gives a 2.5% slowdown.
>>>
>>>> air 6.80 5.28 5.73
>>>> capacita 39.89 35.21 34.96
>>>
>>> Turning on LLVM's vectorizer gives a 5% speedup. GCC gets a 5.5%
>>> speedup from
>>> its vectorizer.
>>>
>>>> channel 2.06 2.29 2.69
>>>
>>> GCC's gets a 30% speedup from its vectorizer which LLVM doesn't
>>> get. On the
>>> other hand, without vectorization LLVM's version runs 23% faster
>>> than GCC's, so
>>> while GCC's vectorizer leaps GCC into the lead, the final speed
>>> difference is
>>> more in the order of GCC 10% faster.
>>>
>>>> doduc 27.35 26.13 25.74
>>>> fatigue 8.83 4.82 4.67
>>>
>>> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.
>>> This is a good one to look at, because all the difference between GCC
>>> and LLVM is coming from the mid-level optimizers: turning on GCC optzns
>>> in dragonegg speeds up the program to GCC levels, so it is possible to
>>> get LLVM IR with and without the effect of GCC optimizations, which
>>> should
>>> make it fairly easy to understand what GCC is doing right here.
>>>
>>>> gas_dyn 11.41 9.79 9.60
>>>
>>> Turning on LLVM's vectorizer gives a 30% speedup. GCC gets a
>>> comparable
>>> speedup from its vectorizer.
>>>
>>>> induct 23.95 21.75 21.14
>>>
>>> GCC's gets a 40% speedup from its vectorizer which LLVM doesn't
>>> get. Like
>>> fatigue, this is a case where we can get IR showing all the
>>> improvements that
>>> the GCC optimizers made.
>>>
>>>> linpk 15.49 15.48 15.69
>>>> mdbx 11.91 11.28 11.39
>>>
>>> Turning on LLVM's vectorizer gives a 2% slowdown
>>>
>>>> nf 29.92 29.57 27.99
>>>> protein 36.34 33.94 31.91
>>>
>>> Turning on LLVM's vectorizer gives a 3% speedup.
>>>
>>>> rnflow 25.97 25.27 22.78
>>>
>>> GCC's gets a 7% speedup from its vectorizer which LLVM doesn't get.
>>>
>>>> test_fpu 11.48 10.91 9.64
>>>
>>> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.
>>>
>>>> tfft 1.92 1.91 1.91
>>>>
>>>> Geom. Mean 13.12 11.70 11.64
>>>
>>> Ciao, Duncan.
>>>
>>>>
>>>> Assuming that the de-gfc48+optzns run really has disabled the llvm
>>>> vectorization,
>>>> I am hoping that additional benchmarking of de-gfc48+optzns with
>>>> individual
>>>> -ffast-math optimizations disabled (such as passing
>>>> -fno-unsafe-math-optimizations)
>>>> may give us a clue as the the origin of the performance delta
>>>> between the stock
>>>> dragonegg results with -ffast-math and those with
>>>> -fplugin-arg-dragonegg-enable-gcc-optzns.
>>>> Jack
>>>>
>>>
>
More information about the llvm-dev
mailing list