[llvm-dev] [RFC][AArch64] Homogeneous Prolog and Epilog for Size Optimization

Wed Mar 25 17:46:32 PDT 2020

I understand it would be interesting to see performance impacts of a set of
benchmarks even under -Oz optimization.
However, I'm not familiar with LNT and its process. I assume this does not
need to run tests on local (arm64) devices, right? If that is the case, I
do not have resource/way to measure them locally. The large benchmark and
rough performance implication I mentioned is from some internal tests from
automation which I simply submitted, but I couldn't share details
unfortunately.
If running LNT does not require a local device, can you share a point of
how I can submit or access such infrastructure to test new compiler?

Regards,
Kyungwoo

On Wed, Mar 25, 2020 at 11:17 AM Vedant Kumar <vedant_kumar at apple.com>
wrote:

> I see. I think it’d help with the upstreaming effort to have some more
> concrete details about performance measurements, so that potential adopters
> can get a rough understanding of the expected impact. In particular, if you
> could share:
>
> - a run-time performance comparison over a representative subset of
> benchmarks from LNT (aarch64/-Oz), taken from a stabilized device
> - some explanation for any performance differences seen in ^
> - ditto for a code size comparison over LNT
> - some brief explanation of the methodology used to measure app startup
> time and the # of page faults before app startup completes
>
> That would be very valuable.
>
> best,
> vedant
>
> On Mar 24, 2020, at 2:04 PM, Kyungwoo Lee <kyulee.llvm at gmail.com> wrote:
>
> Hi Vedant,
>
> Thanks for your interest and comment.
> Size-optimization improves page-faults and a start-up time for a large
> application, which this enabling also followed.
> Even though I didn't see a large regression/complaint on a CPU-bound case,
> which is not a typical case for mobile workload, I wanted to be precautious
> of enabling it by default.
> However, as with default outlining case, I don't mind enabling this under
> -Oz (for minimizing code) with an opt-out option.
>
> Regards,
> Kyungwoo
>
> On Tue, Mar 24, 2020 at 12:01 PM Vedant Kumar <vedant_kumar at apple.com>
> wrote:
>
>> This looks really interesting. In the slides, it’s mentioned that the
>> combination of tuning the MachineOutliner for ThinLTO and of optimizing
>> function prolog/epilogs improved measured run-time performance.
>>
>> What kind of performance impact do you see from simply homogenizing
>> prolog/epilogs? (If, say across LNT/aarch64/-Oz the performance impact is
>> not large, it may make sense to have homogenization enabled by default.)
>>
>> best,
>> vedant
>>
>> On Mar 23, 2020, at 11:32 PM, Kyungwoo Lee via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> Hello,
>>
>> I'd like to upstream our work over the time which the community would
>> benefit from.
>> This is a part of effort toward minimizing code size presented in here
>> <https://llvm.org/devmtg/2020-02-23/slides/Kyungwoo-GlobalMachineOutlinerForThinLTO.pdf>.
>> In particular, this RFC is about optimizing prolog and epilog for size.
>>
>> *Homogeneous Prolog and Epilog for Size Optimization, D76570
>> <https://reviews.llvm.org/D76570>:*
>>
>> Prolog and epilog to handle callee-save registers tend to be irregular
>> with different immediate offsets, which are not often being outlined (by
>> machine outliner) when optimizing for size. From D18619, combining stack
>> operations stretched irregularity further.
>> This patch tries to emit homogeneous stores and loads with the same
>> offset for prolog and epilog respectively.  We have observed that this
>> homogeneous prolog and epilog significantly increased the chance of
>> outlining, resulting in a code size reduction. However, there were still a
>> great deal of outlining opportunities left because the current outliner had
>> to conservatively handle instructions with the return register, x30.
>> Rather, this patch also forms a custom-outlined helper function on demand
>> for prolog and epilog when lowering the frame code.
>>
>> - Injects HOM_Prolog and HOM_Epilog pseudo instructions in Prolog and
>> Epilog Injection Pass
>> - Lower and optimize them in AArchLowerHomogneousPrologEpilog Pass
>> - Outlined helpers are created on demand. Identical helpers are merged by
>> the linker.
>> - An opt-in flag is introduced to enable this feature. Another threshold
>> flag is also introduced to control the aggressiveness of outlining for
>> application's need.
>>
>> This reduced an average of 4% of code size for LLVM-TestSuite/CTMark
>> targeting arm64/-Oz. In a large mobile application, the size benefit was
>> even larger reducing the page-faults as well.
>>
>> *Design Alternatives:*
>>
>> 1. Expand helpers eagerly by permuting all cases in an earlier module
>> pass. Even though this is rather simple and less invasive, it creates many
>> redundant helpers which need to be elided by the linker.
>> 2. Turn Prolog-Epilog-Injection into a module pass. Need to plumb the
>> module through the pass and architecture specific frame-lowering. Not sure
>> about other architecture interaction with this module pass.
>> 3. Runtime/compiler-rt for all helpers. The combinations of helpers are a
>> lot and certainly this approach is not flexible.
>>
>> Regards,
>> Kyungwoo
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200325/d95bac5d/attachment.html>