<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div><div style="text-align: left;direction: ltr; "><span style="-webkit-text-size-adjust: auto;">Jack,</span></div><div style="text-align: left;direction: ltr; "><span style="-webkit-text-size-adjust: auto;"><br></span></div><div style="text-align: left;direction: ltr; "><span style="-webkit-text-size-adjust: auto;">Can you please file a bug report and attach the BC files for the major loops that we miss ? </span></div><div style="text-align: left;direction: ltr; "><span style="-webkit-text-size-adjust: auto;"><br></span></div><div style="text-align: left;direction: ltr; "><span style="-webkit-text-size-adjust: auto;">Thanks,</span></div><div style="text-align: left;direction: ltr; "><span style="-webkit-text-size-adjust: auto;">Nadav</span></div><br><div style="text-align: right;direction: rtl; "><span style="-webkit-text-size-adjust: auto;"><br></span></div></div><div><span style="-webkit-text-size-adjust: auto;">On Jun 2, 2013, at 1:27, Duncan Sands <<a href="mailto:duncan.sands@gmail.com">duncan.sands@gmail.com</a>> wrote:</span><br><br></div><blockquote type="cite" style="-webkit-text-size-adjust: auto; "><div><span>Hi Jack, thanks for splitting out what the effects of LLVM's / GCC's vectorizers</span><br><span>is.</span><br><span></span><br><span>On 01/06/13 21:34, Jack Howarth wrote:</span><br><blockquote type="cite"><span>On Sat, Jun 01, 2013 at 06:45:48AM +0200, Duncan Sands wrote:</span><br></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>These results are very disappointing, I was hoping to see a big improvement</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>somewhere instead of no real improvement anywhere (except for gas_dyn) or a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>regression (eg: mdbx). I think LLVM now has a reasonable array of fast-math</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>optimizations. I will try to find time to poke at gas_dyn and induct: since</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>turning on gcc's optimizations there halve the run-time, LLVM's IR optimizers</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>are clearly missing something important.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Ciao, Duncan.</span><br></blockquote></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Duncan,</span><br></blockquote><blockquote type="cite"><span> Appended are another set of benchmark runs where I attempted to decouple the</span><br></blockquote><blockquote type="cite"><span>fast math optimizations from the vectorization by passing -fno-tree-vectorize.</span><br></blockquote><blockquote type="cite"><span>I am unclear if dragonegg really honors -fno-tree-vectorize to disable the llvm</span><br></blockquote><blockquote type="cite"><span>vectorization.</span><br></blockquote><span></span><br><span>Yes, it does disable LLVM vectorization.</span><br><span></span><br><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Tested on x86_apple-darwin12</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize</span><br></blockquote><span></span><br><span>Maybe -march=native would be a good addition.</span><br><span></span><br><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>de-gfc48: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs</span><br></blockquote><blockquote type="cite"><span>de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.spec</span><br></blockquote><blockquote type="cite"><span>s -fplugin-arg-dragonegg-enable-gcc-optzns</span><br></blockquote><blockquote type="cite"><span>gfortran48: /sw/bin/gfortran-fsf-4.8</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Run time (secs)</span><br></blockquote><span></span><br><span>What is the standard deviation for each benchmark? If each run varies by +-5%</span><br><span>then that means that the changes in runtime of around 3% measured below don't</span><br><span>mean anything.</span><br><span></span><br><span></span><br><span>Comparing with your previous benchmarks, I see:</span><br><span></span><br><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Benchmark de-gfc48 de-gfc48 gfortran48</span><br></blockquote><blockquote type="cite"><span> +optzns</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>ac 11.33 8.10 8.02</span><br></blockquote><span></span><br><span>Turning on LLVM's vectorizer gives a 2% slowdown.</span><br><span></span><br><blockquote type="cite"><span>aermod 16.03 14.45 16.13</span><br></blockquote><span></span><br><span>Turning on LLVM's vectorizer gives a 2.5% slowdown.</span><br><span></span><br><blockquote type="cite"><span>air 6.80 5.28 5.73</span><br></blockquote><blockquote type="cite"><span>capacita 39.89 35.21 34.96</span><br></blockquote><span></span><br><span>Turning on LLVM's vectorizer gives a 5% speedup. GCC gets a 5.5% speedup from</span><br><span>its vectorizer.</span><br><span></span><br><blockquote type="cite"><span>channel 2.06 2.29 2.69</span><br></blockquote><span></span><br><span>GCC's gets a 30% speedup from its vectorizer which LLVM doesn't get. On the</span><br><span>other hand, without vectorization LLVM's version runs 23% faster than GCC's, so</span><br><span>while GCC's vectorizer leaps GCC into the lead, the final speed difference is</span><br><span>more in the order of GCC 10% faster.</span><br><span></span><br><blockquote type="cite"><span>doduc 27.35 26.13 25.74</span><br></blockquote><blockquote type="cite"><span>fatigue 8.83 4.82 4.67</span><br></blockquote><span></span><br><span>GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.</span><br><span>This is a good one to look at, because all the difference between GCC</span><br><span>and LLVM is coming from the mid-level optimizers: turning on GCC optzns</span><br><span>in dragonegg speeds up the program to GCC levels, so it is possible to</span><br><span>get LLVM IR with and without the effect of GCC optimizations, which should</span><br><span>make it fairly easy to understand what GCC is doing right here.</span><br><span></span><br><blockquote type="cite"><span>gas_dyn 11.41 9.79 9.60</span><br></blockquote><span></span><br><span>Turning on LLVM's vectorizer gives a 30% speedup. GCC gets a comparable</span><br><span>speedup from its vectorizer.</span><br><span></span><br><blockquote type="cite"><span>induct 23.95 21.75 21.14</span><br></blockquote><span></span><br><span>GCC's gets a 40% speedup from its vectorizer which LLVM doesn't get. Like</span><br><span>fatigue, this is a case where we can get IR showing all the improvements that</span><br><span>the GCC optimizers made.</span><br><span></span><br><blockquote type="cite"><span>linpk 15.49 15.48 15.69</span><br></blockquote><blockquote type="cite"><span>mdbx 11.91 11.28 11.39</span><br></blockquote><span></span><br><span>Turning on LLVM's vectorizer gives a 2% slowdown</span><br><span></span><br><blockquote type="cite"><span>nf 29.92 29.57 27.99</span><br></blockquote><blockquote type="cite"><span>protein 36.34 33.94 31.91</span><br></blockquote><span></span><br><span>Turning on LLVM's vectorizer gives a 3% speedup.</span><br><span></span><br><blockquote type="cite"><span>rnflow 25.97 25.27 22.78</span><br></blockquote><span></span><br><span>GCC's gets a 7% speedup from its vectorizer which LLVM doesn't get.</span><br><span></span><br><blockquote type="cite"><span>test_fpu 11.48 10.91 9.64</span><br></blockquote><span></span><br><span>GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.</span><br><span></span><br><blockquote type="cite"><span>tfft 1.92 1.91 1.91</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Geom. Mean 13.12 11.70 11.64</span><br></blockquote><span></span><br><span>Ciao, Duncan.</span><br><span></span><br><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Assuming that the de-gfc48+optzns run really has disabled the llvm vectorization,</span><br></blockquote><blockquote type="cite"><span>I am hoping that additional benchmarking of de-gfc48+optzns with individual</span><br></blockquote><blockquote type="cite"><span>-ffast-math optimizations disabled (such as passing -fno-unsafe-math-optimizations)</span><br></blockquote><blockquote type="cite"><span>may give us a clue as the the origin of the performance delta between the stock</span><br></blockquote><blockquote type="cite"><span>dragonegg results with -ffast-math and those with -fplugin-arg-dragonegg-enable-gcc-optzns.</span><br></blockquote><blockquote type="cite"><span> Jack</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><span></span><br></div></blockquote></body></html>