[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores

Tue Sep 11 08:16:25 PDT 2018

Andres:

FWIW, codegen will do the merge if you turn on global alias analysis for it
"-combiner-global-alias-analysis". That said, we should be able to do this
merging earlier.

-Nirav

On Mon, Sep 10, 2018 at 8:33 PM, Andres Freund via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Hi,
>
> On 2018-09-10 13:42:21 -0700, Andres Freund wrote:
> > I have, in postres, a piece of IR that, after inlining and constant
> > propagation boils (when cooked on really high heat) down to (also
> > attached for your convenience):
> >
> > source_filename = "pg"
> > target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
> > target triple = "x86_64-pc-linux-gnu"
> >
> > define void @evalexpr_0_0(i8* align 8 noalias, i32* align 8 noalias) {
> > entry:
> >   %a01 = getelementptr i8, i8* %0, i16 0
> >   store i8 0, i8* %a01
> >
> >   ; in the real case this also loads data
> >   %b01 = getelementptr i32, i32* %1, i16 0
> >   store i32 0, i32* %b01
> >
> >   %a02 = getelementptr i8, i8* %0, i16 1
> >   store i8 0, i8* %a02
> >
> >   ; in the real case this also loads data
> >   %b02 = getelementptr i32, i32* %1, i16 1
> >   store i32 0, i32* %b02
> >
> >   ; in the real case this also loads data
> >   %a03 = getelementptr i8, i8* %0, i16 2
> >   store i8 0, i8* %a03
> >
> >   ; in the real case this also loads data
> >   %b03 = getelementptr i32, i32* %1, i16 2
> >   store i32 0, i32* %b03
> >
> >   %a04 = getelementptr i8, i8* %0, i16 3
> >   store i8 0, i8* %a04
> >
> >   ; in the real case this also loads data
> >   %b04 = getelementptr i32, i32* %1, i16 3
> >   store i32 0, i32* %b04
> >
> >   ret void
> > }
>
> > So, here we finally come to my question: Is it really expected that,
> > unless largely independent optimizations (SLP in this case) happen to
> > move instructions *within the same basic block* out of the way, these
> > stores don't get coalesced?  And then only if the either the
> > optimization pipeline is run again, or if instruction selection can do
> > so?
> >
> >
> > On IRC Roman Lebedev pointed out https://reviews.llvm.org/D48725 which
> > might address this indirectly.  But I'm somewhat doubtful that that's
> > the most straightforward way to optimize this kind of code?
>
> That doesn't help, but it turns out that //reviews.llvm.org/D30703 can
> kinda somwhat help by adding a redundant
>   %i32ptr = bitcast i8* %0 to i32*
>   store i32 0, i32* %i32ptr
>
> at the start. Then dse-partial-store-merging does its magic and
> optimizes the sub-stores away.  But it's fairly ugly to manually have to
> add superflous stores in the right granularity (a larger llvm.memset
> doesn't work).
>
> gcc, since 7, detects such cases in its "new" -fstore-merging pass.
>
> - Andres
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/ed591c3e/attachment.html>