<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">On CINT2006 ARM64/ref input/lto+pgo I practically measure no performance difference for the 7 benchmarks that compile. This includes bzip2 (although different source base than in CINT2000), mcf, hmmer, sjeng, h364ref, astar, xalancbmk<div><br><div><div>On Sep 15, 2014, at 11:59 AM, Hal Finkel <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div style="font-size: 18px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">----- Original Message -----<br><blockquote type="cite">From: "Gerolf Hoflehner" <<a href="mailto:ghoflehner@apple.com">ghoflehner@apple.com</a>><br>To: "Jiangning Liu" <<a href="mailto:liujiangning1@gmail.com">liujiangning1@gmail.com</a>>, "George Burgess IV" <<a href="mailto:george.burgess.iv@gmail.com">george.burgess.iv@gmail.com</a>>, "Hal Finkel"<br><<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>><br>Cc: "LLVM Dev" <<a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a>><br>Sent: Sunday, September 14, 2014 12:15:02 AM<br>Subject: Re: [LLVMdev] Testing the new CFL alias analysis<br><br>In lto+pgo some (5 out of 12 with usual suspect like perlbench and<br>gcc among them using -flto -Wl,-mllvm,-use-cfl-aa<br>-Wl,-mllvm,-use-cfl-aa-in-codegen) the CINT2006 benchmarks don’t<br>compile.<br></blockquote><br>On what platform? Could you bugpoint it and file a report?<br></div></blockquote>Ok, I’ll see that I can get a small test case.<br><blockquote type="cite"><div style="font-size: 18px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;"><br><blockquote type="cite">Has the implementation been tested with lto?<br></blockquote><br>I've not.<br><br><blockquote type="cite">If not, please<br>stress the implementation more.<br>Do we know reasons for gains? Where did you expect the biggest gains?<br></blockquote><br>I don't want to make a global statement here. My expectation is that we'll see wins from increasing register pressure ;) -- hoisting more loads out of loops (there are certainly cases involving multiple-levels of dereferencing and insert/extract instructions where CFL can provide a NoAlias answer where BasicAA gives up). Obviously, we'll also have problems if we increase pressure too much.<br></div></blockquote><div>Maybe. But I prefer the OoO HW to handle hoisting though. It is hard to tune in the compiler. </div><div>I’m also curious about the impact on loop transformations.</div><blockquote type="cite"><div style="font-size: 18px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;"><br><blockquote type="cite">Some of the losses will likely boil down to increased register<br>pressure.<br></blockquote><br>Agreed.<br><br><blockquote type="cite"><br><br>Looks like the current performance numbers pose a good challenge for<br>gaining new and refreshing insights into our heuristics (and for<br>smoothing out the implementation along the way).<br></blockquote><br>It certainly seems that way.<br><br>Thanks again,<br>Hal<br><br><blockquote type="cite"><br><br>Cheers<br>Gerolf<br><br><br><br><br><br><br><br><br><br>On Sep 12, 2014, at 1:27 AM, Jiangning Liu < <a href="mailto:liujiangning1@gmail.com">liujiangning1@gmail.com</a><br><blockquote type="cite">wrote:<br></blockquote><br><br><br>Hi Hal,<br><br>I run on SPEC2000 on cortex-a57(AArch64), and got the following<br>results,<br><br>(It is to measure run-time reduction, and negative is better<br>performance)<br><br>spec.cpu2000.ref.183_equake 33.77%<br>spec.cpu2000.ref.179_art 13.44%<br>spec.cpu2000.ref.256_bzip2 7.80%<br>spec.cpu2000.ref.186_crafty 3.69%<br>spec.cpu2000.ref.175_vpr 2.96%<br>spec.cpu2000.ref.176_gcc 1.77%<br>spec.cpu2000.ref.252_eon 1.77%<br>spec.cpu2000.ref.254_gap 1.19%<br>spec.cpu2000.ref.197_parser 1.15%<br>spec.cpu2000.ref.253_perlbmk 1.11%<br>spec.cpu2000.ref.300_twolf -1.04%<br><br>So we can see almost all got worse performance.<br><br>The command line option I'm using is "-O3 -std=gnu89 -ffast-math<br>-fslp-vectorize -fvectorize -mcpu=cortex-a57 -mllvm -use-cfl-aa<br>-mllvm -use-cfl-aa-in-codegen"<br><br>I didn't try compile-time, and I think your test on POWER7 native<br>build should already meant something for other hosts. Also I don't<br>have a good benchmark suit for compile time testing. My past<br>experiences showed both llvm-test-suite (single/multiple) and spec<br>benchmark are not good benchmarks for compile time testing.<br><br>Thanks,<br>-Jiangning<br><br><br>2014-09-04 1:11 GMT+08:00 Hal Finkel < <a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a> > :<br><br><br>Hello everyone,<br><br>One of Google's summer interns, George Burgess IV, created an<br>implementation of the CFL pointer-aliasing analysis algorithm, and<br>this has now been added to LLVM trunk. Now we should determine<br>whether it is worthwhile adding this to the default optimization<br>pipeline. For ease of testing, I've added the command line option<br>-use-cfl-aa which will cause the CFL analysis to be added to the<br>optimization pipeline. This can be used with the opt program, and<br>also via Clang by passing: -mllvm -use-cfl-aa.<br><br>For the purpose of testing with those targets that make use of<br>aliasing analysis during code generation, there is also a<br>corresponding -use-cfl-aa-in-codegen option.<br><br>Running the test suite on one of our IBM POWER7 systems (comparing<br>-O3 -mcpu=native to -O3 -mcpu=native -mllvm -use-cfl-aa -mllvm<br>-use-cfl-aa-in-codegen [testing without use in code generation were<br>essentially the same]), I see no significant compile-time changes,<br>and the following performance results:<br>speedup:<br>MultiSource/Benchmarks/mafft/pairlocalalign: -11.5862% +/- 5.9257%<br><br>slowdown:<br>MultiSource/Benchmarks/FreeBench/neural/neural: 158.679% +/- 22.3212%<br>MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset:<br>0.627176% +/- 0.290698%<br>MultiSource/Benchmarks/Ptrdist/ks/ks: 57.5457% +/- 21.8869%<br><br>I ran the test suite 20 times in each configuration, using make -j48<br>each time, so I'll only pick up large changes. I've not yet<br>investigated the cause of the slowdowns (or the speedup), and I<br>really need people to try this on x86, ARM, etc. I appears, however,<br>the better aliasing analysis results might have some negative<br>unintended consequences, and we'll need to look at those closely.<br><br>Please let me know how this fares on your systems!<br><br>Thanks again,<br>Hal<br><br>--<br>Hal Finkel<br>Assistant Computational Scientist<br>Leadership Computing Facility<br>Argonne National Laboratory<br>_______________________________________________<br>LLVM Developers mailing list<br><a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu">http://llvm.cs.uiuc.edu</a><br><a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br><br>_______________________________________________<br>LLVM Developers mailing list<br>LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu<br>http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev<br><br><br></blockquote><br>--<span class="Apple-converted-space"> </span><br>Hal Finkel<br>Assistant Computational Scientist<br>Leadership Computing Facility<br>Argonne National Laboratory</div></blockquote></div><br></div></body></html>