[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores

Tue Sep 11 11:21:16 PDT 2018

Hi,

On 2018-09-11 11:16:25 -0400, Nirav Davé wrote:
> Andres:
> 
> FWIW, codegen will do the merge if you turn on global alias analysis for it
> "-combiner-global-alias-analysis". That said, we should be able to do this
> merging earlier.

Interesting. That does *something* for my real case, but certainly not
as much as I'd expected, or what I can get dse-partial-store-merging to
do if I emit some "superflous" earlier store (which encompass all the
previous stores) that allow it to its job.

In the case at hand, with a manual 64bit store (this is on a 64bit
target), llvm then combines 8 byte-wide stores into one.

Without -combiner-global-alias-analysis it generates:

        movb    $0, 1(%rdx)
        movl    4(%rsi,%rdi), %ebx
        movq    %rbx, 8(%rcx)
        movb    $0, 2(%rdx)
        movl    8(%rsi,%rdi), %ebx
        movq    %rbx, 16(%rcx)
        movb    $0, 3(%rdx)
        movl    12(%rsi,%rdi), %ebx
        movq    %rbx, 24(%rcx)
        movb    $0, 4(%rdx)
        movq    16(%rsi,%rdi), %rbx
        movq    %rbx, 32(%rcx)
        movb    $0, 5(%rdx)
        movq    24(%rsi,%rdi), %rbx
        movq    %rbx, 40(%rcx)
        movb    $0, 6(%rdx)
        movq    32(%rsi,%rdi), %rbx
        movq    %rbx, 48(%rcx)
        movb    $0, 7(%rdx)
        movq    40(%rsi,%rdi), %rsi

were (%rdi) is the array of 1 byte values, where I hope to get stores
combined, which is guaranteed to be 8byte aligned.

With out -combiner-global-alias-analysis it generates:

	movw	$0, (%rsi)
	movl	(%rcx,%rdi), %ebx
	movq	%rbx, (%rdx)
	movl	4(%rcx,%rdi), %ebx
	movl	8(%rcx,%rdi), %r8d
	movq	%rbx, 8(%rdx)
	movl	$0, 2(%rsi)
	movq	%r8, 16(%rdx)
	movl	12(%rcx,%rdi), %ebx
	movq	%rbx, 24(%rdx)
	movq	16(%rcx,%rdi), %rbx
	movq	%rbx, 32(%rdx)
	movq	24(%rcx,%rdi), %rbx
	movq	%rbx, 40(%rdx)
	movb	$0, 6(%rsi)
	movq	32(%rcx,%rdi), %rbx
	movq	%rbx, 48(%rdx)
	movb	$0, 7(%rsi)

where (%rsi) is the array of 1-byte values.  So it's a 2, 4, 1, 1
byte store. Huh?

Whereas, if I emit a superflous 8-byte store beforehand it becomes:
        movq    $0, (%rsi)
        movl    (%rcx,%rdi), %ebx
        movq    %rbx, (%rdx)
        movl    4(%rcx,%rdi), %ebx
        movq    %rbx, 8(%rdx)
        movl    8(%rcx,%rdi), %ebx
        movq    %rbx, 16(%rdx)
        movl    12(%rcx,%rdi), %ebx
        movq    %rbx, 24(%rdx)
        movq    16(%rcx,%rdi), %rbx
        movq    %rbx, 32(%rdx)
        movq    24(%rcx,%rdi), %rbx
        movq    %rbx, 40(%rdx)
        movq    32(%rcx,%rdi), %rbx
        movq    %rbx, 48(%rdx)
        movq    40(%rcx,%rdi), %rcx

so just a single 8-byte store.

I've attached the two testfiles (which unfortunately are somewhat
messy):
24703.1.bc - file without "superflous" store
25256.0.bc - file with "superflous" store

the workflow I have, emulating the current pipeline, is:

opt -O3 -disable-slp-vectorization -S < /srv/dev/pgdev-dev/25256.0.bc |llc -O3 [-combiner-global-alias-analysis]

Note that the problem can also occur when -disable-slp-vectorization, it
just requires a larger example.

Greetings,

Andres Freund

> -Nirav
> 
> 
> On Mon, Sep 10, 2018 at 8:33 PM, Andres Freund via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> 
> > Hi,
> >
> > On 2018-09-10 13:42:21 -0700, Andres Freund wrote:
> > > I have, in postres, a piece of IR that, after inlining and constant
> > > propagation boils (when cooked on really high heat) down to (also
> > > attached for your convenience):
> > >
> > > source_filename = "pg"
> > > target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
> > > target triple = "x86_64-pc-linux-gnu"
> > >
> > > define void @evalexpr_0_0(i8* align 8 noalias, i32* align 8 noalias) {
> > > entry:
> > >   %a01 = getelementptr i8, i8* %0, i16 0
> > >   store i8 0, i8* %a01
> > >
> > >   ; in the real case this also loads data
> > >   %b01 = getelementptr i32, i32* %1, i16 0
> > >   store i32 0, i32* %b01
> > >
> > >   %a02 = getelementptr i8, i8* %0, i16 1
> > >   store i8 0, i8* %a02
> > >
> > >   ; in the real case this also loads data
> > >   %b02 = getelementptr i32, i32* %1, i16 1
> > >   store i32 0, i32* %b02
> > >
> > >   ; in the real case this also loads data
> > >   %a03 = getelementptr i8, i8* %0, i16 2
> > >   store i8 0, i8* %a03
> > >
> > >   ; in the real case this also loads data
> > >   %b03 = getelementptr i32, i32* %1, i16 2
> > >   store i32 0, i32* %b03
> > >
> > >   %a04 = getelementptr i8, i8* %0, i16 3
> > >   store i8 0, i8* %a04
> > >
> > >   ; in the real case this also loads data
> > >   %b04 = getelementptr i32, i32* %1, i16 3
> > >   store i32 0, i32* %b04
> > >
> > >   ret void
> > > }
> >
> > > So, here we finally come to my question: Is it really expected that,
> > > unless largely independent optimizations (SLP in this case) happen to
> > > move instructions *within the same basic block* out of the way, these
> > > stores don't get coalesced?  And then only if the either the
> > > optimization pipeline is run again, or if instruction selection can do
> > > so?
> > >
> > >
> > > On IRC Roman Lebedev pointed out https://reviews.llvm.org/D48725 which
> > > might address this indirectly.  But I'm somewhat doubtful that that's
> > > the most straightforward way to optimize this kind of code?
> >
> > That doesn't help, but it turns out that //reviews.llvm.org/D30703 can
> > kinda somwhat help by adding a redundant
> >   %i32ptr = bitcast i8* %0 to i32*
> >   store i32 0, i32* %i32ptr
> >
> > at the start. Then dse-partial-store-merging does its magic and
> > optimizes the sub-stores away.  But it's fairly ugly to manually have to
> > add superflous stores in the right granularity (a larger llvm.memset
> > doesn't work).
> >
> > gcc, since 7, detects such cases in its "new" -fstore-merging pass.
> >
> > - Andres
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 24703.1.bc
Type: application/octet-stream
Size: 12852 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/54fbc469/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 25256.0.bc
Type: application/octet-stream
Size: 12324 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/54fbc469/attachment-0003.obj>