<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div><div style="text-align: left;direction: ltr; "><span style="-webkit-text-size-adjust: auto;">Jack,</span></div><div style="text-align: left;direction: ltr; "><span style="-webkit-text-size-adjust: auto;"><br></span></div><div style="text-align: left;direction: ltr; "><span style="-webkit-text-size-adjust: auto;">Can you please file a bug report and attach the BC files for the major loops that we miss ? </span></div><div style="text-align: left;direction: ltr; "><span style="-webkit-text-size-adjust: auto;"><br></span></div><div style="text-align: left;direction: ltr; "><span style="-webkit-text-size-adjust: auto;">Thanks,</span></div><div style="text-align: left;direction: ltr; "><span style="-webkit-text-size-adjust: auto;">Nadav</span></div><br><div style="text-align: right;direction: rtl; "><span style="-webkit-text-size-adjust: auto;"><br></span></div></div><div><span style="-webkit-text-size-adjust: auto;">On Jun 2, 2013, at 1:27, Duncan Sands <<a href="mailto:duncan.sands@gmail.com">duncan.sands@gmail.com</a>> wrote:</span><br><br></div><blockquote type="cite" style="-webkit-text-size-adjust: auto; "><div><span>Hi Jack, thanks for splitting out what the effects of LLVM's / GCC's vectorizers</span><br><span>is.</span><br><span></span><br><span>On 01/06/13 21:34, Jack Howarth wrote:</span><br><blockquote type="cite"><span>On Sat, Jun 01, 2013 at 06:45:48AM +0200, Duncan Sands wrote:</span><br></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>These results are very disappointing, I was hoping to see a big improvement</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>somewhere instead of no real improvement anywhere (except for gas_dyn) or a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>regression (eg: mdbx).  I think LLVM now has a reasonable array of fast-math</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>optimizations.  I will try to find time to poke at gas_dyn and induct: since</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>turning on gcc's optimizations there halve the run-time, LLVM's IR optimizers</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>are clearly missing something important.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Ciao, Duncan.</span><br></blockquote></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Duncan,</span><br></blockquote><blockquote type="cite"><span>    Appended are another set of benchmark runs where I attempted to decouple the</span><br></blockquote><blockquote type="cite"><span>fast math optimizations from the vectorization by passing -fno-tree-vectorize.</span><br></blockquote><blockquote type="cite"><span>I am unclear if dragonegg really honors -fno-tree-vectorize to disable the llvm</span><br></blockquote><blockquote type="cite"><span>vectorization.</span><br></blockquote><span></span><br><span>Yes, it does disable LLVM vectorization.</span><br><span></span><br><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Tested on x86_apple-darwin12</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize</span><br></blockquote><span></span><br><span>Maybe -march=native would be a good addition.</span><br><span></span><br><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>de-gfc48: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs</span><br></blockquote><blockquote type="cite"><span>de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.spec</span><br></blockquote><blockquote type="cite"><span>s -fplugin-arg-dragonegg-enable-gcc-optzns</span><br></blockquote><blockquote type="cite"><span>gfortran48: /sw/bin/gfortran-fsf-4.8</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Run time (secs)</span><br></blockquote><span></span><br><span>What is the standard deviation for each benchmark?  If each run varies by +-5%</span><br><span>then that means that the changes in runtime of around 3% measured below don't</span><br><span>mean anything.</span><br><span></span><br><span></span><br><span>Comparing with your previous benchmarks, I see:</span><br><span></span><br><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Benchmark     de-gfc48  de-gfc48   gfortran48</span><br></blockquote><blockquote type="cite"><span>                         +optzns</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>ac             11.33      8.10       8.02</span><br></blockquote><span></span><br><span>Turning on LLVM's vectorizer gives a 2% slowdown.</span><br><span></span><br><blockquote type="cite"><span>aermod         16.03     14.45      16.13</span><br></blockquote><span></span><br><span>Turning on LLVM's vectorizer gives a 2.5% slowdown.</span><br><span></span><br><blockquote type="cite"><span>air             6.80      5.28       5.73</span><br></blockquote><blockquote type="cite"><span>capacita       39.89     35.21      34.96</span><br></blockquote><span></span><br><span>Turning on LLVM's vectorizer gives a 5% speedup.  GCC gets a 5.5% speedup from</span><br><span>its vectorizer.</span><br><span></span><br><blockquote type="cite"><span>channel         2.06      2.29       2.69</span><br></blockquote><span></span><br><span>GCC's gets a 30% speedup from its vectorizer which LLVM doesn't get.  On the</span><br><span>other hand, without vectorization LLVM's version runs 23% faster than GCC's, so</span><br><span>while GCC's vectorizer leaps GCC into the lead, the final speed difference is</span><br><span>more in the order of GCC 10% faster.</span><br><span></span><br><blockquote type="cite"><span>doduc          27.35     26.13      25.74</span><br></blockquote><blockquote type="cite"><span>fatigue         8.83      4.82       4.67</span><br></blockquote><span></span><br><span>GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.</span><br><span>This is a good one to look at, because all the difference between GCC</span><br><span>and LLVM is coming from the mid-level optimizers: turning on GCC optzns</span><br><span>in dragonegg speeds up the program to GCC levels, so it is possible to</span><br><span>get LLVM IR with and without the effect of GCC optimizations, which should</span><br><span>make it fairly easy to understand what GCC is doing right here.</span><br><span></span><br><blockquote type="cite"><span>gas_dyn        11.41      9.79       9.60</span><br></blockquote><span></span><br><span>Turning on LLVM's vectorizer gives a 30% speedup.  GCC gets a comparable</span><br><span>speedup from its vectorizer.</span><br><span></span><br><blockquote type="cite"><span>induct         23.95     21.75      21.14</span><br></blockquote><span></span><br><span>GCC's gets a 40% speedup from its vectorizer which LLVM doesn't get.  Like</span><br><span>fatigue, this is a case where we can get IR showing all the improvements that</span><br><span>the GCC optimizers made.</span><br><span></span><br><blockquote type="cite"><span>linpk          15.49     15.48      15.69</span><br></blockquote><blockquote type="cite"><span>mdbx           11.91     11.28      11.39</span><br></blockquote><span></span><br><span>Turning on LLVM's vectorizer gives a 2% slowdown</span><br><span></span><br><blockquote type="cite"><span>nf             29.92     29.57      27.99</span><br></blockquote><blockquote type="cite"><span>protein        36.34     33.94      31.91</span><br></blockquote><span></span><br><span>Turning on LLVM's vectorizer gives a 3% speedup.</span><br><span></span><br><blockquote type="cite"><span>rnflow         25.97     25.27      22.78</span><br></blockquote><span></span><br><span>GCC's gets a 7% speedup from its vectorizer which LLVM doesn't get.</span><br><span></span><br><blockquote type="cite"><span>test_fpu       11.48     10.91       9.64</span><br></blockquote><span></span><br><span>GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.</span><br><span></span><br><blockquote type="cite"><span>tfft            1.92      1.91       1.91</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Geom. Mean     13.12     11.70      11.64</span><br></blockquote><span></span><br><span>Ciao, Duncan.</span><br><span></span><br><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Assuming that the de-gfc48+optzns run really has disabled the llvm vectorization,</span><br></blockquote><blockquote type="cite"><span>I am hoping that additional benchmarking of de-gfc48+optzns with individual</span><br></blockquote><blockquote type="cite"><span>-ffast-math optimizations disabled (such as passing -fno-unsafe-math-optimizations)</span><br></blockquote><blockquote type="cite"><span>may give us a clue as the the origin of the performance delta between the stock</span><br></blockquote><blockquote type="cite"><span>dragonegg results with -ffast-math and those with -fplugin-arg-dragonegg-enable-gcc-optzns.</span><br></blockquote><blockquote type="cite"><span>       Jack</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><span></span><br></div></blockquote></body></html>