[llvm-dev] Dead store elimination in the backend for -ftrivial-auto-var-init

Mon Aug 26 09:33:38 PDT 2019

On Thu, Aug 8, 2019 at 8:43 PM Vitaly Buka <vitalybuka at google.com> wrote:
>
> I've updated the patch to produce this code:
>
>   .text
>   .file  "auto-var-init-dse.c"
>   .globl  ipt_do_table            # -- Begin function ipt_do_table
>   .p2align  4, 0x90
>   .type  ipt_do_table, at function
> ipt_do_table:                           # @ipt_do_table
>   .cfi_startproc
> # %bb.0:                                # %entry
>   pushq  %rbx
>   .cfi_def_cfa_offset 16
>   subq  $40, %rsp
>   .cfi_def_cfa_offset 56
>   .cfi_offset %rbx, -16
>   movl  $0, 4(%rsp)
>   movl  $0, 33(%rsp)
>   movl  $0, 36(%rsp)
>   xorl  %eax, %eax
>   callq  m
> ....
Great!
I'll try to give these patches a shot this week.
> These movl can't be removed at it's paddings never initialized here, and we don't know if q() will read them.
> However these cases relatively rare, on Chromium I see less than 0.01% binary size improvement.
There's less than 50 functions consuming >1% performance in the
kernel, but saving a couple of instructions here and there can give a
notable performance boost.
>
> On Tue, Aug 6, 2019 at 6:42 PM Vitaly Buka <vitalybuka at google.com> wrote:
>>
>> There are two problems:
>> 1. padding after union and call to q(), without LTO we can't remove that store.
>> 2. shortcut which I have which ignores all instructions q() . this assume that memset to acpar.match, acpar.matchinfo also useful which is not true. I should be able to improve this case.
>>
>> On Thu, Aug 1, 2019 at 11:29 PM Vitaly Buka <vitalybuka at google.com> wrote:
>>>
>>> On a first look case like this should nor be a problem. The tail of the memset here is unused because it's replaced immediately without reads . So it can be trimmed. I will try the patch and let you know.
>>>
>>> On Thu, Aug 1, 2019 at 9:51 AM Alexander Potapenko <glider at google.com> wrote:
>>>>
>>>> On Thu, Aug 1, 2019 at 6:38 PM JF Bastien <jfbastien at apple.com> wrote:
>>>> >
>>>> >
>>>> >
>>>> > > On Aug 1, 2019, at 9:20 AM, Alexander Potapenko <glider at google.com> wrote:
>>>> > >
>>>> > > On Thu, Aug 1, 2019 at 6:09 PM JF Bastien <jfbastien at apple.com> wrote:
>>>> > >>
>>>> > >> Hi Alexander,
>>>> > >>
>>>> > >> The code doesn’t compile. Could you send a godbolt.org link that shows the issue?
>>>> > > Sorry about that, here's the link: https://godbolt.org/z/-PinQP
>>>> > >
>>>> > > Lines 4 to 8 are initializing |acpar|.
>>>> > > If I'm understanding correctly, the store to 8(%rsp) at line 7 can be
>>>> > > removed because of the store at line 35.
>>>>
>>>> >
>>>> > These are in different basic blocks. IIRC there are a bunch of cases where DCE just fails when there’s any control flow. I think this is what you’re seeing here.
>>>> I think Vitaly's patches (https://reviews.llvm.org/D61879) should
>>>> allow DSE to work across basic blocks. But, as you mentioned, the
>>>> problem is that we only have a memset at IR level.
>>>> >
>>>> > > Same for store to 24(%rsp), which is later overwritten at lines 11 and 16.
>>>> >
>>>> > I’m not sure what we should do here… If it were a constant store we’d fold it in, but here both stores are variables so it’s not clear that we really should be doing much in MemCpyOptimizer.
>>>> > The problem I see here is: there’s a small memset, followed by variable stores to the same location. DCE’ing requires realizing that the memset is small enough that the backend will expand it, at which point DCE will kick in. MemCpyOptimizer doesn’t know what “small” is, partly because it’s based on alignment and optimization level. Maybe it can be a target hook (“what would you do with this memset?”), that MemCpyOptimizer uses when it sees a memset followed by variable stores to the same location? We already fold in constant stores (replace the m() call by one)… but again this isn’t clearly a win: if you enable SSE you’ll get wide stores, followed by stores of smaller size. With SSE I’m not sure I’d want us to remove the MOVAPS instructions.
>>>> Yes, this depends very much on the backend flags, therefore it might
>>>> be too early to lower that memset before DSE.
>>>> I was thinking about some kind of a peephole optimization, that
>>>> deletes stores to the same memory location (at least for %rsp-based
>>>> addresses) in the backend.
>>>> I suspect AArch64 has something like this already, at least we don't
>>>> see these redundant stores there.
>>>>
>>>>
>>>> >
>>>> >
>>>> > >> Thanks,
>>>> > >>
>>>> > >> JF
>>>> > >>
>>>> > >>> On Aug 1, 2019, at 7:21 AM, Alexander Potapenko <glider at google.com> wrote:
>>>> > >>>
>>>> > >>> Hi folks,
>>>> > >>>
>>>> > >>> When compiling the attached example with -ftrivial-auto-var-init=zero:
>>>> > >>>
>>>> > >>> $ clang   -no-integrated-as  -mno-sse -m64 -mstack-alignment=8    -O2
>>>> > >>> -ftrivial-auto-var-init=zero
>>>> > >>> -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang
>>>> > >>> -g  -o ipt.ll -c ipt.i -w -S -emit-llvm
>>>> > >>>
>>>> > >>> , Clang generates an initialization memset() call for |acpar| in the IR:
>>>> > >>>
>>>> > >>>   %0 = bitcast %struct.xt_action_param* %acpar to i8*, !dbg !27
>>>> > >>>   call void @llvm.memset.p0i8.i64(i8* nonnull align 8 %0, i8 0, i64
>>>> > >>> 40, i1 false), !dbg !28
>>>> > >>>
>>>> > >>> Clang only splits memsets into series of stores under certain
>>>> > >>> conditions, so it's hard to remove redundant stores on the IR level
>>>> > >>> (even with the recent DSE patches: https://reviews.llvm.org/D61879).
>>>> > >>> In the resulting assembly, however, the memset is lowered into a
>>>> > >>> series of MOVQ instructions (see the attached ipt.s file), of which at
>>>> > >>> least the stores to 24(%rsp) and 16(%rsp) are redundant.
>>>> > >>>
>>>> > >>> Given that at MCInst level we already know if a memset is split into a
>>>> > >>> sequence of stores, can we do a better job by making the backend
>>>> > >>> delete these redundant stores for us?
>>>> > >>>
>>>> > >>> Alex
>>>> > >>>
>>>> > >>> --
>>>> > >>> Alexander Potapenko
>>>> > >>> Software Engineer
>>>> > >>>
>>>> > >>> Google Germany GmbH
>>>> > >>> Erika-Mann-Straße, 33
>>>> > >>> 80636 München
>>>> > >>>
>>>> > >>> Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
>>>> > >>> Registergericht und -nummer: Hamburg, HRB 86891
>>>> > >>> Sitz der Gesellschaft: Hamburg
>>>> > >>> <ipt.i><ipt.s>
>>>> > >>
>>>> > >
>>>> > >
>>>> > > --
>>>> > > Alexander Potapenko
>>>> > > Software Engineer
>>>> > >
>>>> > > Google Germany GmbH
>>>> > > Erika-Mann-Straße, 33
>>>> > > 80636 München
>>>> > >
>>>> > > Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
>>>> > > Registergericht und -nummer: Hamburg, HRB 86891
>>>> > > Sitz der Gesellschaft: Hamburg
>>>> >
>>>>
>>>>
>>>> --
>>>> Alexander Potapenko
>>>> Software Engineer
>>>>
>>>> Google Germany GmbH
>>>> Erika-Mann-Straße, 33
>>>> 80636 München
>>>>
>>>> Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
>>>> Registergericht und -nummer: Hamburg, HRB 86891
>>>> Sitz der Gesellschaft: Hamburg


-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg