[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
Andres Freund via llvm-dev
llvm-dev at lists.llvm.org
Tue Sep 11 11:21:16 PDT 2018
Hi,
On 2018-09-11 11:16:25 -0400, Nirav Davé wrote:
> Andres:
>
> FWIW, codegen will do the merge if you turn on global alias analysis for it
> "-combiner-global-alias-analysis". That said, we should be able to do this
> merging earlier.
Interesting. That does *something* for my real case, but certainly not
as much as I'd expected, or what I can get dse-partial-store-merging to
do if I emit some "superflous" earlier store (which encompass all the
previous stores) that allow it to its job.
In the case at hand, with a manual 64bit store (this is on a 64bit
target), llvm then combines 8 byte-wide stores into one.
Without -combiner-global-alias-analysis it generates:
movb $0, 1(%rdx)
movl 4(%rsi,%rdi), %ebx
movq %rbx, 8(%rcx)
movb $0, 2(%rdx)
movl 8(%rsi,%rdi), %ebx
movq %rbx, 16(%rcx)
movb $0, 3(%rdx)
movl 12(%rsi,%rdi), %ebx
movq %rbx, 24(%rcx)
movb $0, 4(%rdx)
movq 16(%rsi,%rdi), %rbx
movq %rbx, 32(%rcx)
movb $0, 5(%rdx)
movq 24(%rsi,%rdi), %rbx
movq %rbx, 40(%rcx)
movb $0, 6(%rdx)
movq 32(%rsi,%rdi), %rbx
movq %rbx, 48(%rcx)
movb $0, 7(%rdx)
movq 40(%rsi,%rdi), %rsi
were (%rdi) is the array of 1 byte values, where I hope to get stores
combined, which is guaranteed to be 8byte aligned.
With out -combiner-global-alias-analysis it generates:
movw $0, (%rsi)
movl (%rcx,%rdi), %ebx
movq %rbx, (%rdx)
movl 4(%rcx,%rdi), %ebx
movl 8(%rcx,%rdi), %r8d
movq %rbx, 8(%rdx)
movl $0, 2(%rsi)
movq %r8, 16(%rdx)
movl 12(%rcx,%rdi), %ebx
movq %rbx, 24(%rdx)
movq 16(%rcx,%rdi), %rbx
movq %rbx, 32(%rdx)
movq 24(%rcx,%rdi), %rbx
movq %rbx, 40(%rdx)
movb $0, 6(%rsi)
movq 32(%rcx,%rdi), %rbx
movq %rbx, 48(%rdx)
movb $0, 7(%rsi)
where (%rsi) is the array of 1-byte values. So it's a 2, 4, 1, 1
byte store. Huh?
Whereas, if I emit a superflous 8-byte store beforehand it becomes:
movq $0, (%rsi)
movl (%rcx,%rdi), %ebx
movq %rbx, (%rdx)
movl 4(%rcx,%rdi), %ebx
movq %rbx, 8(%rdx)
movl 8(%rcx,%rdi), %ebx
movq %rbx, 16(%rdx)
movl 12(%rcx,%rdi), %ebx
movq %rbx, 24(%rdx)
movq 16(%rcx,%rdi), %rbx
movq %rbx, 32(%rdx)
movq 24(%rcx,%rdi), %rbx
movq %rbx, 40(%rdx)
movq 32(%rcx,%rdi), %rbx
movq %rbx, 48(%rdx)
movq 40(%rcx,%rdi), %rcx
so just a single 8-byte store.
I've attached the two testfiles (which unfortunately are somewhat
messy):
24703.1.bc - file without "superflous" store
25256.0.bc - file with "superflous" store
the workflow I have, emulating the current pipeline, is:
opt -O3 -disable-slp-vectorization -S < /srv/dev/pgdev-dev/25256.0.bc |llc -O3 [-combiner-global-alias-analysis]
Note that the problem can also occur when -disable-slp-vectorization, it
just requires a larger example.
Greetings,
Andres Freund
> -Nirav
>
>
> On Mon, Sep 10, 2018 at 8:33 PM, Andres Freund via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> > Hi,
> >
> > On 2018-09-10 13:42:21 -0700, Andres Freund wrote:
> > > I have, in postres, a piece of IR that, after inlining and constant
> > > propagation boils (when cooked on really high heat) down to (also
> > > attached for your convenience):
> > >
> > > source_filename = "pg"
> > > target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
> > > target triple = "x86_64-pc-linux-gnu"
> > >
> > > define void @evalexpr_0_0(i8* align 8 noalias, i32* align 8 noalias) {
> > > entry:
> > > %a01 = getelementptr i8, i8* %0, i16 0
> > > store i8 0, i8* %a01
> > >
> > > ; in the real case this also loads data
> > > %b01 = getelementptr i32, i32* %1, i16 0
> > > store i32 0, i32* %b01
> > >
> > > %a02 = getelementptr i8, i8* %0, i16 1
> > > store i8 0, i8* %a02
> > >
> > > ; in the real case this also loads data
> > > %b02 = getelementptr i32, i32* %1, i16 1
> > > store i32 0, i32* %b02
> > >
> > > ; in the real case this also loads data
> > > %a03 = getelementptr i8, i8* %0, i16 2
> > > store i8 0, i8* %a03
> > >
> > > ; in the real case this also loads data
> > > %b03 = getelementptr i32, i32* %1, i16 2
> > > store i32 0, i32* %b03
> > >
> > > %a04 = getelementptr i8, i8* %0, i16 3
> > > store i8 0, i8* %a04
> > >
> > > ; in the real case this also loads data
> > > %b04 = getelementptr i32, i32* %1, i16 3
> > > store i32 0, i32* %b04
> > >
> > > ret void
> > > }
> >
> > > So, here we finally come to my question: Is it really expected that,
> > > unless largely independent optimizations (SLP in this case) happen to
> > > move instructions *within the same basic block* out of the way, these
> > > stores don't get coalesced? And then only if the either the
> > > optimization pipeline is run again, or if instruction selection can do
> > > so?
> > >
> > >
> > > On IRC Roman Lebedev pointed out https://reviews.llvm.org/D48725 which
> > > might address this indirectly. But I'm somewhat doubtful that that's
> > > the most straightforward way to optimize this kind of code?
> >
> > That doesn't help, but it turns out that //reviews.llvm.org/D30703 can
> > kinda somwhat help by adding a redundant
> > %i32ptr = bitcast i8* %0 to i32*
> > store i32 0, i32* %i32ptr
> >
> > at the start. Then dse-partial-store-merging does its magic and
> > optimizes the sub-stores away. But it's fairly ugly to manually have to
> > add superflous stores in the right granularity (a larger llvm.memset
> > doesn't work).
> >
> > gcc, since 7, detects such cases in its "new" -fstore-merging pass.
> >
> > - Andres
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 24703.1.bc
Type: application/octet-stream
Size: 12852 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/54fbc469/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 25256.0.bc
Type: application/octet-stream
Size: 12324 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/54fbc469/attachment-0003.obj>
More information about the llvm-dev
mailing list