[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores

Andres Freund via llvm-dev llvm-dev at lists.llvm.org
Mon Sep 10 17:33:32 PDT 2018


Hi,

On 2018-09-10 13:42:21 -0700, Andres Freund wrote:
> I have, in postres, a piece of IR that, after inlining and constant
> propagation boils (when cooked on really high heat) down to (also
> attached for your convenience):
> 
> source_filename = "pg"
> target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
> target triple = "x86_64-pc-linux-gnu"
> 
> define void @evalexpr_0_0(i8* align 8 noalias, i32* align 8 noalias) {
> entry:
>   %a01 = getelementptr i8, i8* %0, i16 0
>   store i8 0, i8* %a01
> 
>   ; in the real case this also loads data
>   %b01 = getelementptr i32, i32* %1, i16 0
>   store i32 0, i32* %b01
> 
>   %a02 = getelementptr i8, i8* %0, i16 1
>   store i8 0, i8* %a02
> 
>   ; in the real case this also loads data
>   %b02 = getelementptr i32, i32* %1, i16 1
>   store i32 0, i32* %b02
> 
>   ; in the real case this also loads data
>   %a03 = getelementptr i8, i8* %0, i16 2
>   store i8 0, i8* %a03
> 
>   ; in the real case this also loads data
>   %b03 = getelementptr i32, i32* %1, i16 2
>   store i32 0, i32* %b03
> 
>   %a04 = getelementptr i8, i8* %0, i16 3
>   store i8 0, i8* %a04
> 
>   ; in the real case this also loads data
>   %b04 = getelementptr i32, i32* %1, i16 3
>   store i32 0, i32* %b04
> 
>   ret void
> }

> So, here we finally come to my question: Is it really expected that,
> unless largely independent optimizations (SLP in this case) happen to
> move instructions *within the same basic block* out of the way, these
> stores don't get coalesced?  And then only if the either the
> optimization pipeline is run again, or if instruction selection can do
> so?
> 
> 
> On IRC Roman Lebedev pointed out https://reviews.llvm.org/D48725 which
> might address this indirectly.  But I'm somewhat doubtful that that's
> the most straightforward way to optimize this kind of code?

That doesn't help, but it turns out that //reviews.llvm.org/D30703 can
kinda somwhat help by adding a redundant
  %i32ptr = bitcast i8* %0 to i32*
  store i32 0, i32* %i32ptr

at the start. Then dse-partial-store-merging does its magic and
optimizes the sub-stores away.  But it's fairly ugly to manually have to
add superflous stores in the right granularity (a larger llvm.memset
doesn't work).

gcc, since 7, detects such cases in its "new" -fstore-merging pass.

- Andres


More information about the llvm-dev mailing list