[LLVMdev] Proposal: AArch64/ARM64 merge from EuroLLVM

Thu Apr 17 20:00:03 PDT 2014

Hi Quentin,

BTW, the command line option of enabling CSE ADRP should be,

-fno-common -mllvm -global-merge-on-external=true -mllvm -global-merge=true

If you want to measure the base, the command line should be,

-mllvm -global-merge=false

Thanks,
-Jiangning

2014-04-17 1:16 GMT+08:00 Quentin Colombet <qcolombet at apple.com>:

> Hi Jiangning,
>
> On Apr 15, 2014, at 11:12 PM, Jiangning Liu <liujiangning1 at gmail.com>
> wrote:
>
> Hi Quentin,
>
> Thanks for your feedback!
>
>> ARM64 generates pseudo instructions ARM64::MOVaddr and friends in ISEL
>> stage, which intends to guarantee address serialization (page address +
>> in-page address), and exposes adrp finally by pass ExpandPseudoInsts. The
>> assumption of ARM64 solution is we don't know the in-page offset can be
>> fused into load/store or not at compile time, and this assumption would
>> turn to be not true any longer for the solution of using global merge as I
>> proposed with the patch.
>>
>> I think this is orthogonal. If you happen to merge globals they will have
>> the same base address (i.e., the same pseudo instruction) but different
>> offsets.
>> CSE and such will work like a charm for the pseudos.
>>
>
> This is probably not true. Global merge pass happens in PreIsel stage. For
> my test case at http://reviews.llvm.org/D3223, after applying the patch,
> we will have LLVM IR as below,
>
>
>   store i32 %a1, i32* getelementptr inbounds ({ i32, i32, i32 }*
> @_MergedGlobals_x, i32 0, i32 0), align 4
>   store i32 %a2, i32* getelementptr inbounds ({ i32, i32, i32 }*
> @_MergedGlobals_x, i32 0, i32 1), align 4
>
> and after ISEL stage, we can see different Machine Instructions generated
> for AArch64 and ARM64.
>
> AArch64:
>
>         %vreg4<def> = ADRPxi <ga:@_MergedGlobals_x>; GPR64noxzr:%vreg4
>         LS32_STR %vreg3, %vreg4, <ga:@_MergedGlobals_x>[TF=11]
>         %vreg5<def> = ADDxxi_lsl0_s %vreg4, <ga:@_MergedGlobals_x>[TF=11];
> GPR64noxzr:%vreg5,%vreg4
>         LS32_STR %vreg2, %vreg5<kill>, 1
>
> ARM64:
>
>         %vreg2<def> = ADRP <ga:@_MergedGlobals_x>[TF=1];
> GPR64common:%vreg2
>         STRWui %vreg0, %vreg2<kill>, <ga:@_MergedGlobals_x>[TF=18]
>         %vreg3<def> = MOVaddr <ga:@_MergedGlobals_x>[TF=1], <ga:@_MergedGlobals_x>[TF=18];
> GPR64common:%vreg3
>         STRWui %vreg1, %vreg3<kill>, 1
>
> The problem is MOVaddr generated for ARM64  implies introducing adrp in
> ExpandPseudoInsts pass again, although at this moment we don't really see
> redundant ADRP yet. AArch64 is using ADDxxi_lsl0_s instead, and it will be
> folded into LS32_STR finally.
>
> Interesting.
> Looks like we are too clever here.
> I would have expected ISel to generate one base address and one
> displacement.
>
> I believe that if we fix that both the LOHs and the global merge become
> orthogonal. My guess is that we should be less aggressive at folding offset
> if there are several uses.
>
>
> Assuming you emit the right instructions at isel time, you will create
>> ADRP, LOADGot, or ADD with symbols. Since you do not know anything on the
>> symbols, CSE will match only the ones that are identical.
>>
>
> This is correct.
>
>
>> You will have a finer granularity to do CSE, but I am not sure it will
>> help that much.
>>
>
> The 'CSE' here is a term only rather than the traditional CSE. Since
> global variables are merged into a monolithic data structure, the we will
> be able to generate only one base address (page address) for all of those
> global variables.
>
>
>> On the other hand, you lose the rematerialization capability, because
>> that feature can only handle one instruction at a time. So you will still
>> be able to rematerialize ADRP but not the LOADGot and ADD with symbols.
>>
>
> Yes, but this depends on register pressure, and it's hard to tell
> rematerialization is always good.
>
> Sure, but it can help :).
>
> If simply apply the global merge solution to ARM64, probably we should
>> avoid generating pseudo instruction MOVaddr and friends in ISEL stage, but
>> I'm not sure if the LOH solution would still work or not, because,
>> 1) ARM64 link-time optimization depends on LOH.
>> 2) We don't see linker plug-in in LLVM trunk and it would be hard for me
>> to verify any thoughts.
>>
>> The LOH solution is also orthogonal. You can see that as a last chance
>> way to optimize those accesses.
>> That said, if you CSE the ADRP and not the LOADGot, you will indeed
>> create far less candidates for the LOHs because you will have ADRPs with
>> several uses, which is not supported by LOHs.
>>
>
> Yes. This is just what I'm worrying about. So essentially those two
> optimizations have conflict.
>
> Let us try to fix the codegen problem while keeping the pseudos.
>
>
>
>> FYI, the LOH optimization is not a link-time optimization in LLVM, this
>> is really a link-time optimization: on the binary.
>>
>
> Yes. I see.
>
>>  Any concrete suggestion of combining those different ADRP CSE solutions
>> and tests would be appreciated!
>>
>> The bottom line is whatever you are doing with merge globals, it is
>> orthogonal with LOHs.
>> That said I think it is best to keep the pseudo instructions.
>>
>
> Well, if we keep the pseudo instruction MOVaddr, we would have to keep
> adrp and expose it to binary, so it would lose the opportunity of removing
> redundant adrp at compile-time.
>
>
>> Of course I may be wrong and the best way to check would be to measure
>> what happens if you get rid of the pseudo instructions. Do not be too
>> concerned with the impact on the LOHs.
>>
>
> Since compile-time ADRP CSE is not so powerful as link-time ADRP removal,
> I don't want to hurt link-time solution.
>
> Well, this is something that should be measured. Your patch does not kill
> the LOHs, it may just reduce the number of potential candidates. For each
> candidate that your patch removes, it means we at least spare one ADRP
> instruction. The trade-off does not seem bad.
>
> I suggest we:
> 1. Fix the ISel of pseudo (making the folding less aggressive).
> 2. Measure the performance with your patch.
>
> I can definitely help for the measurements with the LOHs enabled in
> parallel with your patch.
> If you want I can help for #1 too.
>
> Side question, did you happen to measure any performance
> improvement/regression with your patch?
> I’d like to know which tests would be good candidates to measure the
> impact of your patch + LOHs enabled.
>
> Thanks,
> -Quentin
>
>
>> Thanks,
>> -Quentin
>>
>>
>> Thanks,
>> -Jiangning
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/75eaf6db/attachment.html>