<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">I see. I think it’d help with the upstreaming effort to have some more concrete details about performance measurements, so that potential adopters can get a rough understanding of the expected impact. In particular, if you could share:<div class=""><br class=""><div class="">- a run-time performance comparison over a representative subset of benchmarks from LNT (aarch64/-Oz), taken from a stabilized device</div><div class="">- some explanation for any performance differences seen in ^</div><div class="">- ditto for a code size comparison over LNT</div><div class="">- some brief explanation of the methodology used to measure app startup time and the # of page faults before app startup completes</div><div class=""><br class=""></div><div class="">That would be very valuable.</div><div class=""><br class=""></div><div class="">best,</div><div class="">vedant</div><div class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Mar 24, 2020, at 2:04 PM, Kyungwoo Lee <<a href="mailto:kyulee.llvm@gmail.com" class="">kyulee.llvm@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Hi <span style="color:rgb(32,33,36);font-size:0.875rem;letter-spacing:0.2px;font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;white-space:nowrap" class="">Vedant,</span><div class=""><span style="color:rgb(32,33,36);font-size:0.875rem;letter-spacing:0.2px;font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;white-space:nowrap" class=""><br class=""></span></div><div class=""><span style="color:rgb(32,33,36);font-size:0.875rem;letter-spacing:0.2px;font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;white-space:nowrap" class="">Thanks for your interest and comment.</span></div><div class=""><span style="color:rgb(32,33,36);font-size:0.875rem;letter-spacing:0.2px;font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;white-space:nowrap" class="">Size-optimization improves page-faults and a start-up time for a large application, which this enabling also followed. </span></div><div class=""><span style="color:rgb(32,33,36);font-size:0.875rem;letter-spacing:0.2px;font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;white-space:nowrap" class="">Even though I didn't see a large regression/complaint on a CPU-bound case, which is not a typical case for mobile workload, I wanted to be precautious of enabling it by default.</span></div><div class=""><span style="color:rgb(32,33,36);font-size:0.875rem;letter-spacing:0.2px;font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;white-space:nowrap" class="">However, as with default outlining case, I don't mind enabling this under -Oz (for minimizing code) with an opt-out option.</span></div><div class=""><span style="color:rgb(32,33,36);font-size:0.875rem;letter-spacing:0.2px;font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;white-space:nowrap" class=""><br class=""></span></div><div class=""><span style="color:rgb(32,33,36);font-size:0.875rem;letter-spacing:0.2px;font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;white-space:nowrap" class="">Regards,</span></div><div class=""><span style="color:rgb(32,33,36);font-size:0.875rem;letter-spacing:0.2px;font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;white-space:nowrap" class="">Kyungwoo</span></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Mar 24, 2020 at 12:01 PM Vedant Kumar <<a href="mailto:vedant_kumar@apple.com" class="">vedant_kumar@apple.com</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;" class="">This looks really interesting. In the slides, it’s mentioned that the combination of tuning the MachineOutliner for ThinLTO and of optimizing function prolog/epilogs improved measured run-time performance.<div class=""><br class=""></div><div class="">What kind of performance impact do you see from simply homogenizing prolog/epilogs? (If, say across LNT/aarch64/-Oz the performance impact is not large, it may make sense to have homogenization enabled by default.)</div><div class=""><br class=""></div><div class="">best,</div><div class="">vedant<br class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On Mar 23, 2020, at 11:32 PM, Kyungwoo Lee via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank" class="">llvm-dev@lists.llvm.org</a>> wrote:</div><br class=""><div class=""><div dir="ltr" class="">Hello,<br class=""><br class="">I'd like to upstream our work over the time which the community would benefit from.<br class="">This is a part of effort toward minimizing code size presented in <a href="https://llvm.org/devmtg/2020-02-23/slides/Kyungwoo-GlobalMachineOutlinerForThinLTO.pdf" target="_blank" class="">here</a>. In particular, this RFC is about optimizing prolog and epilog for size.<br class=""><br class=""><b class="">Homogeneous Prolog and Epilog for Size Optimization, <a href="https://reviews.llvm.org/D76570" target="_blank" class="">D76570</a>:</b><br class=""><br class="">Prolog and epilog to handle callee-save registers tend to be irregular with different immediate offsets, which are not often being outlined (by machine outliner) when optimizing for size. From D18619, combining stack operations stretched irregularity further.<br class="">This patch tries to emit homogeneous stores and loads with the same offset for prolog and epilog respectively.  We have observed that this homogeneous prolog and epilog significantly increased the chance of outlining, resulting in a code size reduction. However, there were still a great deal of outlining opportunities left because the current outliner had to conservatively handle instructions with the return register, x30.<br class="">Rather, this patch also forms a custom-outlined helper function on demand for prolog and epilog when lowering the frame code.<br class=""><br class="">- Injects HOM_Prolog and HOM_Epilog pseudo instructions in Prolog and Epilog Injection Pass<br class="">- Lower and optimize them in AArchLowerHomogneousPrologEpilog Pass<br class="">- Outlined helpers are created on demand. Identical helpers are merged by the linker.<br class="">- An opt-in flag is introduced to enable this feature. Another threshold flag is also introduced to control the aggressiveness of outlining for application's need.<br class=""><br class="">This reduced an average of 4% of code size for LLVM-TestSuite/CTMark targeting arm64/-Oz. In a large mobile application, the size benefit was even larger reducing the page-faults as well.<br class=""> <br class=""><b class="">Design Alternatives:</b><br class=""><br class="">1. Expand helpers eagerly by permuting all cases in an earlier module pass. Even though this is rather simple and less invasive, it creates many redundant helpers which need to be elided by the linker.<br class="">2. Turn Prolog-Epilog-Injection into a module pass. Need to plumb the module through the pass and architecture specific frame-lowering. Not sure about other architecture interaction with this module pass.<br class="">3. Runtime/compiler-rt for all helpers. The combinations of helpers are a lot and certainly this approach is not flexible.<br class=""><br class="">Regards,<br class="">Kyungwoo<br class=""></div>

_______________________________________________<br class="">LLVM Developers mailing list<br class=""><a href="mailto:llvm-dev@lists.llvm.org" target="_blank" class="">llvm-dev@lists.llvm.org</a><br class=""><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" target="_blank" class="">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br class=""></div></blockquote></div><br class=""></div></div></blockquote></div>

</div></blockquote></div><br class=""></div></div></body></html>