Disable arm-gepopt by default

Wed Apr 22 02:15:52 PDT 2015

r235431.

On Wed, 22 Apr 2015 at 08:45 James Molloy <james at jamesmolloy.co.uk> wrote:

> Hi Michael and Gerolf,
>
> Gerolf, I agree in general terms. However in gobmk these madds being
> removed are not being strength reduced, they are redundant. Not only that
> but they hold two registers live as arguments, so we're reducing both
> number of instructions and register pressure and not increasing any other
> metric. However the gain will depend on the number of multiply functional
> units your microarchitecture has.
>
> Michael, thanks for that explanation. It shows that there really is a
> problem, where GEP matching is superior to arithmetic matching. It may just
> be that the pointer arithmetic emitted by GEP lowering differs slightly
> from canonical form and that is what our instruction selector is tuned for.
> But it's certainly nontrivial so that, combined with the other issues,
> makes me inclined to turn it off by default.
>
> However, I robustly defend the position I took. I don't think pointing the
> finger at a pass due to one testcase over 9 months later and asking to
> disable that pass, without looking for an underlying cause, is the best way
> of improving LLVM. If the pass were newly added, I would feel differently
> of course.
>
> Cheers,
>
> James
>
>
>
> On Wed, 22 Apr 2015 at 03:00 Michael Zolotukhin <mzolotukhin at apple.com>
> wrote:
>
>> Hi James,
>>
>> I’ve spent some time getting familiar with the source and looking into
>> the generated assembly, and here is what I found.
>>
>> First, let me start with some examples where the pass did some
>> not-optimzal transformation. The code snippets are from random functions
>> (not necessarily hot ones), but they still should give the idea.
>>
>> *Case 1:*
>> *gepopt=true+D9148:*
>> mul x8, x8, x9
>> stp x11, x8, [sp, #32]
>> add x8, x25, #8
>> str x8, [sp, #48]
>> add x8, x25, #56
>> str x8, [sp, #16]
>> movz w8, #72
>> mul x22, x26, x8
>> add x8, x25, x22
>> ldrb w10, [x8, #23]
>> sxtb w9, w10
>> ldr x11, [sp, #48]
>> ldr x11, [x11, x22]
>>
>> *gepopt=false:*
>> mul x28, x8, x9
>> movz w8, #72
>> madd x8, x26, x8, x25
>> ldrb w10, [x8, #23]
>> sxtb w9, w10
>> ldr x11, [x8, #8]
>>
>> In this case we have a load from [x25+8+x26*72] address, and the pass
>> split the expression into two parts: (x25+8) and (x26*72). That increased
>> register pressure, and we had to generate a spill-fill pair for (x25+8)
>> expression. Without the pass we generated madd and a load with a simple
>> addressing mode.
>>
>>
>> *Case 2:*
>> *Gepopt=true+D9148:*
>> ldr x22, [x0, #136]
>> orr w9, wzr, #0x18
>> mul x8, x8, x9
>> add x23, x22, x8
>> str wzr, [x23]
>> orr x26, x8, #0x4
>> str wzr, [x26, x22]
>> add x24, x8, #8
>> str wzr, [x24, x22]
>> add x25, x8, #16
>> ldr x27, [x25, x22]
>> cbz x27, 0xABCDABCD
>>
>> *Gepopt=false:*
>> ldr x9, [x0, #136]
>> orr w10, wzr, #0x18
>> madd x22, x8, x10, x9
>> str xzr, [x22]
>> str wzr, [x22, #8]
>> ldr x23, [x22, #16]
>> cbz x23, 0xABCDABCD
>>
>> In this case the hoisted common part seems to be just x22, and the we
>> generate stores to [x22+x8*0x18], [x22+x8*0x18+4], and [x22+x8*0x18+8]. New
>> addresses also seem to be more confusing for other optimization passes, so
>> we weren’t able to merge adjacent stores, as we did with gepopt=false. We
>> also didn’t fuse ‘add’ and ‘mul’ to ‘madd’ in this case.
>>
>>
>> Looking into these examples, I thought that the problem might be that we
>> do this transformation always and unconditionally - we don’t check if it’s
>> beneficial at all. This would work fine if the transformation is free, i.e.
>> if codegen always can roll everything back. But it’s not free, and in cases
>> when we don’t actually hit the target (i.e. hoist a common part of several
>> GEP-expressions), we often lose.
>>
>> I see how this pass can be beneficial for targets like NVPTX which
>> (according to comments in SeparateConstOffsetFromGEP.cpp) don't support
>> complex addressing modes with several registers. However, it’s not that
>> clear whether it’s beneficial or not on targets like ARM64. I’m inclining
>> to believe that turning this pass in the current state on by default was
>> the ‘large hammer’. I still think it could be beneficial, but it needs some
>> more work for it.
>>
>> Thanks,
>> Michael
>>
>> On Apr 21, 2015, at 11:38 AM, James Molloy <james at jamesmolloy.co.uk>
>> wrote:
>>
>> Hi Michael,
>>
>> Thanks. I'll take a look tomorrow. At least this is nothing to do with
>> the pass itself- if re association regresses instruction selection,
>> something is fragile!
>>
>> Will you have time to look at this one today? It's now 19:30 here and I'm
>> away for the evening.
>>
>> Cheers,
>>
>> James
>> On Tue, 21 Apr 2015 at 19:36, Michael Zolotukhin <mzolotukhin at apple.com>
>> wrote:
>>
>>> Hi James,
>>>
>>> It looks like the first patch gives nearly the same results as both
>>> patches together (at least, in terms of gains and regressions on
>>> geekbench). E.g. the code generated for mandelbrot is the same.
>>>
>>> Thanks,
>>> Michael
>>>
>>> On Apr 21, 2015, at 11:29 AM, James Molloy <james at jamesmolloy.co.uk>
>>> wrote:
>>>
>>> Hi Michael,
>>>
>>> Jingue pointed out that my second patch did pessimize some code, so do
>>> you still see the regressions with my first patch (the reassoc one)?
>>>
>>> Cheers,
>>>
>>> James
>>> On Tue, 21 Apr 2015 at 19:26, Michael Zolotukhin <mzolotukhin at apple.com>
>>> wrote:
>>>
>>>> Hi James,
>>>>
>>>> Thank you for taking your time to work on it! I tried the patches you
>>>> posted and they did fix the problem on twofish. However, with them there
>>>> are other regressions - and while you are at it, could you please take a
>>>> look at the generated code? Probably, you can come up with another one-line
>>>> fix.
>>>>
>>>> So, one regression is on Geekbench/Mandlerbrot, here is the ‘fast’ and
>>>> ‘slow’ disassembly files (fast - from reference compiler, slow - reference
>>>> compiler+D9148+D9149).
>>>>
>>>>
>>>> I can see that we generate less ‘madd’ instructions, but in some cases
>>>> we end up with less-optimal sequences, including ‘mul', ‘add' and ‘or'
>>>> instead. Also, in some cases we now generate 2 ‘str' instead of 1 ‘stp'.
>>>>
>>>> Do you have any ideas what might go wrong this time?
>>>>
>>>> Thanks,
>>>> Michael
>>>>
>>>> On Apr 21, 2015, at 6:55 AM, James Molloy <james at jamesmolloy.co.uk>
>>>> wrote:
>>>>
>>>> See:
>>>>
>>>> http://reviews.llvm.org/D9148
>>>> http://reviews.llvm.org/D9149
>>>>
>>>> Cheers,
>>>>
>>>> James
>>>>
>>>> On Tue, 21 Apr 2015 at 12:06 James Molloy <james at jamesmolloy.co.uk>
>>>> wrote:
>>>>
>>>>> Hi Michael, Gerolf,
>>>>>
>>>>> The main goal of this optimization was to reduce the number of
>>>>> multiply-adds emitted when complex GEPs are used. With regards to SPEC,
>>>>> this is particularly prevalent in sjeng and gobmk (although we don't just
>>>>> care about SPEC and there were plenty of other places we caught this idiom).
>>>>>
>>>>> The pass is actively reducing the number of multiply-adds in gobmk:
>>>>>
>>>>> $ llc engine.ll -O3 -mllvm -aarch64-gep-opt=false -o - |& grep madd |
>>>>> wc -l
>>>>>  297
>>>>>
>>>>> $ llc engine.ll -O3 -mllvm -aarch64-gep-opt=true -o - |& grep madd |
>>>>> wc -l
>>>>>  158
>>>>>
>>>>> So it demonstrably is making at least some code better.
>>>>>
>>>>> I was able to reproduce your twofish example. Twofish, like most
>>>>> crypto algorithms, has a pattern repeated many times so the problem could
>>>>> just be down to one pattern producing one more instruction, which it turns
>>>>> out is exactly the case.
>>>>>
>>>>> The problem is that with gep-opt enabled, we're associating shifts and
>>>>> constants the other way around.
>>>>>
>>>>> With gep-opt disabled (synthetic code example):
>>>>>    add x0, x0, #72
>>>>>    ldr x0, [x1, x0, lsl #2]
>>>>>
>>>>> With gep-opt enabled:
>>>>>    add x0, x1, x0, lsl #2
>>>>>    ldr x0, [x0, #72]
>>>>>
>>>>> This is actually fine, until another expression reuses a different
>>>>> shift with the same offset:
>>>>>
>>>>> With gep-opt disabled:
>>>>>    add x0, x0, #72
>>>>>    ldr x0, [x1, x0, lsl #2]
>>>>>    ldr x2, [x1, x0, lsl #3]
>>>>>
>>>>> With gep-opt enabled:
>>>>>    add x0, x1, x0, lsl #2
>>>>>    add x2, x1, x0, lsl #3
>>>>>    ldr x0, [x0, #72]
>>>>>    ldr x2, [x0, #72]
>>>>>
>>>>> Amusingly, I saw the opposite behaviour on AArch32 on a different
>>>>> testcase and came to the conclusion that expressions with correlated
>>>>> immediate offsets and identical shift amounts were more likely than
>>>>> expressions with correlated shift amounts and identical immediate offsets,
>>>>> but that's a different story.
>>>>>
>>>>> Anyway, the reason this is happening is because we're not running
>>>>> reassociate after GEP expansion. If we run reassociate it all goes away,
>>>>> it's a one-line fix.
>>>>>
>>>>> The two testcases that were disabled - well, they weren't really
>>>>> disabled, but the behaviour was changed for them. I put that down to a bug
>>>>> in the SeparateConstOffsetFromGEP pass, where it tries to separate a GEP
>>>>> with only one variadic offset:
>>>>>
>>>>>   GEP i8* %base, i32 %offset
>>>>>
>>>>> Nothing beneficial can come of that, so I've got, again, a single line
>>>>> patch to fix this. I'll push both patches upstream just as soon as I verify
>>>>> them.
>>>>>
>>>>> So in summary, I don't see any need to disable this pass. It's doing
>>>>> good work, it just has one or two bugs.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> James
>>>>>
>>>>> On Tue, 21 Apr 2015 at 05:27 Michael Zolotukhin <mzolotukhin at apple.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Tim and others,
>>>>>>
>>>>>> So, the entire discussion is moving around the assumption that this
>>>>>> pass is beneficial in some cases, but we don’t have such data. Please note
>>>>>> that I don’t want to say that the assumption is wrong, but as far as I
>>>>>> understand, only Hao observed some gains. Thus, until we hear from him, we
>>>>>> have real slow downs versus presumable gains.
>>>>>>
>>>>>> As for the looking into the code later - I’m not an expert in this
>>>>>> area, and I don’t even have a motivational example for this optimization,
>>>>>> so I’d barely be a right person for it. I’d be glad to assist with
>>>>>> collecting some additional data/running tests if that might help though.
>>>>>>
>>>>>> Thanks,
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>> > On Apr 20, 2015, at 8:40 PM, Tim Northover <t.p.northover at gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > On 20 April 2015 at 20:34, Gerolf Hoflehner <ghoflehner at apple.com>
>>>>>> wrote:
>>>>>> >> It shouldn’t come across as ‘extreme’.
>>>>>> >
>>>>>> > The "extreme" part wasn't so much the tentative disabling as "Could
>>>>>> > you and/or Hao take this on after the code is disabled/backed
>>>>>> out?". I
>>>>>> > realise it wasn't an ultimatum, but it could very easily turn into
>>>>>> > "we're washing our hands of this", which I'd object to.
>>>>>> >
>>>>>> > I assume the pass was committed with some kind of metric that it
>>>>>> > improved, and now we have a counter-example. It's down to everyone
>>>>>> to
>>>>>> > decide what the best way to proceed is.
>>>>>> >
>>>>>> > Cheers.
>>>>>> >
>>>>>> > Tim.
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > llvm-commits mailing list
>>>>>> > llvm-commits at cs.uiuc.edu
>>>>>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> llvm-commits mailing list
>>>>>> llvm-commits at cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>
>>>>>
>>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150422/d5bdf1af/attachment.html>