[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths

Wed Nov 10 01:30:09 PST 2021

+Nicolas Vasilache <ntv at google.com> :)

On Tue, Nov 9, 2021 at 10:32 PM Simon Pilgrim via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> On 09/11/2021 20:44, Simon Pilgrim wrote:
>
> > On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote:
> >> Hi everyone,
> >>
> >> I am experimenting with LLVM lowering, intrinsics and shufflevector
> >> in general.
> >>
> >> Here is an IR that I produce with the objective of emitting some
> >> vblendps instructions:
> >>
> https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a.
>
> >>
> > From what I can see, the original IR code was (effectively):
> >
> > 8 x UNPCKLPS/UNPCKHPS
> > 4 x SHUFPS
> > 8 x BLENDPS
> > 4 x INSERTF128
> > 4 x PERM2F128
> >
> >> I compile this further with
> >>
> >> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3
> >> -mcpu=haswell - -o -
> >>
> >> to obtain:
> >>
> >>
> https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a
> >>
> >
> > and after the x86 shuffle combines:
> >
> > 8 x UNPCKLPS/UNPCKHPS
> > 8 x UNPCKLPD/UNPCKHPD
> > 4 x INSERTF128
> > 4 x PERM2F128
> >
> > Starting from each BLENDPS, they've combined with the SHUFPS to create
> > the UNPCK*PD nodes. We nearly always benefit from folding shuffle
> > chains to reduce total instruction counts, even if some inner nodes
> > have multiple uses (like the SHUFPS), and I'd hate to lose that.
> >
> >> At this point, I would expect to see some vblendps instructions
> >> generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55
> >> and %57/%58 to reduce pressure on port 5 (vblendps can also go on
> >> ports 0 and 1). However the expected instruction does not get
> >> generated and llvm-mca continues to show me high port 5 contention.
> >>
> >> Could people suggest some steps / commands to help better understand
> >> why my expectation is not met and whether I can do something to make
> >> the compiler generate what I want? Thanks in advance!
> > So on Haswell, we've gained 4 extra Port5-only shuffles but removed
> > the 8 Port015 blends.
> >
> > We have very little arch-specific shuffle combines, just the
> > fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask
> > loads, the shuffle combines just aims for the reduction in simple
> > target shuffle nodes. And tbh I'm reluctant to add to this as shuffle
> > combining is complex already.
> >
> > We should be preferring to lower/combine to BLENDPS in more
> > circumstances (its commutable and never slower than any other target
> > shuffle, although demanded elts can do less with 'undef' elements),
> > but that won't help us here.
> >
> > So far I've failed to find a BLEND-based 8x8 transpose pattern that
> > the shuffle combiner doesn't manage to combine back to the
> > 8xUNPCK/SHUFPS ops :(
>
> The only thing I can think of is you might want to see if you can
> reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and
> the SHUFPS/BLENDPS:
>
> 8 x UNPCKLPS/UNPCKHPS
> 4 x INSERTF128
> 4 x PERM2F128
> 4 x SHUFPS
> 8 x BLENDPS
>
> Splitting the per-lane shuffles with the subvector-shuffles could help
> stop the shuffle combiner.
>
> >> I have verified independently that in isolation, a single such
> >> shuffle creates a vblendps. I see them being recombined in the
> >> produced assembly and I am looking for experimenting with avoiding
> >> that vshufps + vblendps + vblendps get recombined into vunpckxxx +
> >> vunpckxxx instructions.
> >>
> >> --
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211110/69d9faea/attachment-0001.html>