[LLVMdev] [RFC] AArch64: Should we disable GlobalMerge?

Fri Feb 27 15:45:40 PST 2015

On Fri, Feb 27, 2015 at 2:21 PM, Eric Christopher <echristo at gmail.com> wrote:
>> > Before making the disabling darwin only I'd like to see some analysis of
>> > the
>> > regressions/improvements. Has anyone looked at the code for those yet?
>>
>> Yep, I put a quick analysis in my other reply.
>
>
> The LOH/ADRP bit?
>
>>
>>
>> >
>> >>
>> >> As for other targets, as a first step, making the pass run under -O3
>> >> rather than -O1 is hopefully agreeable to everyone?  After all, it is
>> >> "aggressive", and isn't always profitable.  That's pretty much the
>> >> description of -O3.
>> >> We can still run into problematic cases under LTO, though.
>> >>
>> >
>> > Seems reasonable to me, but probably want to see what happens with the
>> > above
>> > questions first.
>>
>> Fair enough.  Bottom line is:
>> - disabling it without LTO is a slight win on the test-suite, a solid
>> win everywhere else I've looked.
>> - disabling it with LTO regresses quite a few SPEC benchmarks, and is
>> overall a slight regression on the test-suite.
>>
>
> Ah, I meant an analysis of the code, not just the numbers. I think the
> ADRP/LOH commentary really helps. It might only be a decent LTOish
> optimization, but I'm still curious how it's helping there over other
> optimizations.

Basically - and I think this is what Renato asks as well - it doesn't
really interact with later optimizations.  Throughout most of the
backend, we keep global references (e.g., adrp+add) together, as a
pseudo instruction (MOVaddr, LOADgot, ...).  Very late we expand it to
adrp+add/....  So, the only thing that helps is the LOH linker
optimizations, which try to simplify some of the adrp sequences.
Really, the backend is oblivious to the fact that global references
aren't trivial.  We don't try to CSE the adrp's, for instance (I
believe there was a patch for that, Quentin and Jiangning might know
more).  Does that clarify a bit?

Looking at the code, you have two main problematic situations:
- the register pressure tradeoff:

Consider:

adrp x8, 133
ldr x8, [x8, #3568]
...
adrp x8, 133
ldr x0, [x8, #3576]

Turning into:

adrp x19, 133
add x19, x19, #3392
ldr x8, [x19, #192]
...
ldr x0, [x19, #200]

- an additional instruction when only one global from a merged set is
accessed (or when the LOH optimizations fired)

Consider the similar:

adrp x20, 133
ldr x8, [x20, #3432]
...
str x0, [x20, #3432]

Turning into:

adrp x20, 133
add x20, x20, #3392
ldr x8, [x20, #56]
...
str x0, [x20, #56]

One positive case is explained in the GlobalMerge.cpp comments:  it
reduces register pressure in a loop,  by using a single base register
for multiple globals.

Another positive is that merging globals effectively CSEs the base
address computation.

> Anyhow, FWIW I'm in favor of pulling it out of the non-LTO pipeline
> universally.

I tend to agree, but it's still sometimes useful in non-LTO.  One case
that came up in benchmarks was a bunch of file-static globals used
pervasively in a single file  (I believe lex/yacc can generate this
kind of thing).  There it's very beneficial, even without LTO.  Hence,
-O3 and -mno-global-merge, if necessary.

-Ahmed

> -eric
>
>>
>> -Ahmed
>>
>> > -eric
>> >