[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths

Wed Nov 10 01:46:03 PST 2021

On Wed, Nov 10, 2021 at 10:30 AM Diego Caballero <diegocaballero at google.com>
wrote:

> +Nicolas Vasilache <ntv at google.com> :)
>

Thanks Diego, email is hard, I could not find ways to inject myself into my
own discussion...

>
> On Tue, Nov 9, 2021 at 10:32 PM Simon Pilgrim via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> On 09/11/2021 20:44, Simon Pilgrim wrote:
>>
>> > On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote:
>> >> Hi everyone,
>> >>
>> >> I am experimenting with LLVM lowering, intrinsics and shufflevector
>> >> in general.
>> >>
>> >> Here is an IR that I produce with the objective of emitting some
>> >> vblendps instructions:
>> >>
>> https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a.
>>
>> >>
>> > From what I can see, the original IR code was (effectively):
>> >
>> > 8 x UNPCKLPS/UNPCKHPS
>> > 4 x SHUFPS
>> > 8 x BLENDPS
>> > 4 x INSERTF128
>> > 4 x PERM2F128
>> >
>> >> I compile this further with
>> >>
>> >> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3
>> >> -mcpu=haswell - -o -
>> >>
>> >> to obtain:
>> >>
>> >>
>> https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a
>> >>
>> >
>> > and after the x86 shuffle combines:
>> >
>> > 8 x UNPCKLPS/UNPCKHPS
>> > 8 x UNPCKLPD/UNPCKHPD
>> > 4 x INSERTF128
>> > 4 x PERM2F128
>> >
>> > Starting from each BLENDPS, they've combined with the SHUFPS to create
>> > the UNPCK*PD nodes. We nearly always benefit from folding shuffle
>> > chains to reduce total instruction counts, even if some inner nodes
>> > have multiple uses (like the SHUFPS), and I'd hate to lose that.
>> >
>> >> At this point, I would expect to see some vblendps instructions
>> >> generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55
>> >> and %57/%58 to reduce pressure on port 5 (vblendps can also go on
>> >> ports 0 and 1). However the expected instruction does not get
>> >> generated and llvm-mca continues to show me high port 5 contention.
>> >>
>> >> Could people suggest some steps / commands to help better understand
>> >> why my expectation is not met and whether I can do something to make
>> >> the compiler generate what I want? Thanks in advance!
>> > So on Haswell, we've gained 4 extra Port5-only shuffles but removed
>> > the 8 Port015 blends.
>> >
>> > We have very little arch-specific shuffle combines, just the
>> > fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask
>> > loads, the shuffle combines just aims for the reduction in simple
>> > target shuffle nodes. And tbh I'm reluctant to add to this as shuffle
>> > combining is complex already.
>> >
>> > We should be preferring to lower/combine to BLENDPS in more
>> > circumstances (its commutable and never slower than any other target
>> > shuffle, although demanded elts can do less with 'undef' elements),
>> > but that won't help us here.
>> >
>> > So far I've failed to find a BLEND-based 8x8 transpose pattern that
>> > the shuffle combiner doesn't manage to combine back to the
>> > 8xUNPCK/SHUFPS ops :(
>>
>
If you are referring to this specific code, yes same for me.
If you are thinking about the general 8x8 transpose problem, I have a
version with vector<4xf32> loads that ends up using blends; as expected,
the port 5 pressure reduction helps and both llvm-mca and runtime agree
that this is 20-30% faster.

>
>> The only thing I can think of is you might want to see if you can
>> reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and
>> the SHUFPS/BLENDPS:
>>
>> 8 x UNPCKLPS/UNPCKHPS
>> 4 x INSERTF128
>> 4 x PERM2F128
>> 4 x SHUFPS
>> 8 x BLENDPS
>>
>> Splitting the per-lane shuffles with the subvector-shuffles could help
>> stop the shuffle combiner.
>>
>
Right, I tried different variations here but invariably getting the same
result.
The vector<4xf32> based version is something that I also want to target for
a bunch of orthogonal reasons.
I'll note that my use case is MLIR codegen with explicit vectors and
intrinsics -> LLVM so I have quite some flexibility.
But it feels unnatural in the compiler flow to have to branch off at a
significant higher-level of abstraction to sidestep concerns related to X86
microarchitecture details.

As I am very new to this part of LLVM, I am not sure what is feasible or
not. Would it be envisionnable to either:
1. have a way to inject some numeric cost to influence the value of some
resulting combinations?
2. revive some form of intrinsic and guarantee that the instruction would
be generated?

I realize point 2. is contrary to the evolution of LLVM as these intrinsics
were removed ca. 2015 in favor of the combiner-based approach.
Still it seems that `we have very little arch-specific shuffle combines`
could be the signal that such intrinsics are needed?

>
>> >> I have verified independently that in isolation, a single such
>> >> shuffle creates a vblendps. I see them being recombined in the
>> >> produced assembly and I am looking for experimenting with avoiding
>> >> that vshufps + vblendps + vblendps get recombined into vunpckxxx +
>> >> vunpckxxx instructions.
>> >>
>> >> --
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>

-- 
N
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211110/2f74824c/attachment.html>