[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths

Sun Nov 14 07:52:59 PST 2021

Nicolas - have you investigated just using inline asm instead?

On 11/11/2021 08:34, Wang, Pengfei via llvm-dev wrote:
>
> >As I am very new to this part of LLVM, I am not sure what is feasible 
> or not. Would it be envisionnable to either:
>
> >1. have a way to inject some numeric cost to influence the value of 
> some resulting combinations?
>
> >2. revive some form of intrinsic and guarantee that the instruction 
> would be generated?
>
> I think a feasible way is to add a new tuningXXX feature for given 
> targets and do something different with the flag in the combine.
>
> 1) seems overengineering and 2) seems overkilled for potential 
> opportunities by the combine.
>
> Thanks
>
> Phoebe
>
> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> *On Behalf Of 
> *Nicolas Vasilache via llvm-dev
> *Sent:* Wednesday, November 10, 2021 5:46 PM
> *To:* Diego Caballero <diegocaballero at google.com>
> *Cc:* llvm-dev at lists.llvm.org
> *Subject:* Re: [llvm-dev] Understanding and controlling some of the 
> AVX shuffle emission paths
>
> On Wed, Nov 10, 2021 at 10:30 AM Diego Caballero 
> <diegocaballero at google.com> wrote:
>
>     +Nicolas Vasilache <mailto:ntv at google.com> :)
>
> Thanks Diego, email is hard, I could not find ways to inject myself 
> into my own discussion...
>
>     On Tue, Nov 9, 2021 at 10:32 PM Simon Pilgrim via llvm-dev
>     <llvm-dev at lists.llvm.org> wrote:
>
>         On 09/11/2021 20:44, Simon Pilgrim wrote:
>
>         > On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote:
>         >> Hi everyone,
>         >>
>         >> I am experimenting with LLVM lowering, intrinsics and
>         shufflevector
>         >> in general.
>         >>
>         >> Here is an IR that I produce with the objective of emitting
>         some
>         >> vblendps instructions:
>         >>
>         https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a.
>
>         >>
>         > From what I can see, the original IR code was (effectively):
>         >
>         > 8 x UNPCKLPS/UNPCKHPS
>         > 4 x SHUFPS
>         > 8 x BLENDPS
>         > 4 x INSERTF128
>         > 4 x PERM2F128
>         >
>         >> I compile this further with
>         >>
>         >> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3
>         >> -mcpu=haswell - -o -
>         >>
>         >> to obtain:
>         >>
>         >>
>         https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a
>
>         >>
>         >
>         > and after the x86 shuffle combines:
>         >
>         > 8 x UNPCKLPS/UNPCKHPS
>         > 8 x UNPCKLPD/UNPCKHPD
>         > 4 x INSERTF128
>         > 4 x PERM2F128
>         >
>         > Starting from each BLENDPS, they've combined with the SHUFPS
>         to create
>         > the UNPCK*PD nodes. We nearly always benefit from folding
>         shuffle
>         > chains to reduce total instruction counts, even if some
>         inner nodes
>         > have multiple uses (like the SHUFPS), and I'd hate to lose that.
>         >
>         >> At this point, I would expect to see some vblendps
>         instructions
>         >> generated for the pieces of IR that produce %48/%49 %51/%52
>         %54/%55
>         >> and %57/%58 to reduce pressure on port 5 (vblendps can also
>         go on
>         >> ports 0 and 1). However the expected instruction does not get
>         >> generated and llvm-mca continues to show me high port 5
>         contention.
>         >>
>         >> Could people suggest some steps / commands to help better
>         understand
>         >> why my expectation is not met and whether I can do
>         something to make
>         >> the compiler generate what I want? Thanks in advance!
>         > So on Haswell, we've gained 4 extra Port5-only shuffles but
>         removed
>         > the 8 Port015 blends.
>         >
>         > We have very little arch-specific shuffle combines, just the
>         > fast-variable-shuffle tuning flags to avoid unnecessary
>         shuffle mask
>         > loads, the shuffle combines just aims for the reduction in
>         simple
>         > target shuffle nodes. And tbh I'm reluctant to add to this
>         as shuffle
>         > combining is complex already.
>         >
>         > We should be preferring to lower/combine to BLENDPS in more
>         > circumstances (its commutable and never slower than any
>         other target
>         > shuffle, although demanded elts can do less with 'undef'
>         elements),
>         > but that won't help us here.
>         >
>         > So far I've failed to find a BLEND-based 8x8 transpose
>         pattern that
>         > the shuffle combiner doesn't manage to combine back to the
>         > 8xUNPCK/SHUFPS ops :(
>
> If you are referring to this specific code, yes same for me.
>
> If you are thinking about the general 8x8 transpose problem, I have a 
> version with vector<4xf32> loads that ends up using blends; as 
> expected, the port 5 pressure reduction helps and both llvm-mca and 
> runtime agree that this is 20-30% faster.
>
>
>         The only thing I can think of is you might want to see if you can
>         reorder the INSERTF128/PERM2F128 shuffles in between the
>         UNPACK*PS and
>         the SHUFPS/BLENDPS:
>
>         8 x UNPCKLPS/UNPCKHPS
>         4 x INSERTF128
>         4 x PERM2F128
>         4 x SHUFPS
>         8 x BLENDPS
>
>         Splitting the per-lane shuffles with the subvector-shuffles
>         could help
>         stop the shuffle combiner.
>
> Right, I tried different variations here but invariably getting the 
> same result.
>
> The vector<4xf32> based version is something that I also want to 
> target for a bunch of orthogonal reasons.
>
> I'll note that my use case is MLIR codegen with explicit vectors and 
> intrinsics -> LLVM so I have quite some flexibility.
>
> But it feels unnatural in the compiler flow to have to branch off at a 
> significant higher-level of abstraction to sidestep concerns related 
> to X86 microarchitecture details.
>
> As I am very new to this part of LLVM, I am not sure what is feasible 
> or not. Would it be envisionnable to either:
>
> 1. have a way to inject some numeric cost to influence the value of 
> some resulting combinations?
>
> 2. revive some form of intrinsic and guarantee that the instruction 
> would be generated?
>
> I realize point 2. is contrary to the evolution of LLVM as these 
> intrinsics were removed ca. 2015 in favor of the combiner-based approach.
>
> Still it seems that `we have very little arch-specific shuffle 
> combines` could be the signal that such intrinsics are needed?
>
>
>         >> I have verified independently that in isolation, a single such
>         >> shuffle creates a vblendps. I see them being recombined in the
>         >> produced assembly and I am looking for experimenting with
>         avoiding
>         >> that vshufps + vblendps + vblendps get recombined into
>         vunpckxxx +
>         >> vunpckxxx instructions.
>         >>
>         >> --
>         _______________________________________________
>         LLVM Developers mailing list
>         llvm-dev at lists.llvm.org
>         https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> -- 
>
> N
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211114/9647197a/attachment.html>