[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths

Sun Nov 21 11:21:44 PST 2021

For (my) future self-reference, here is the code to auto-generate
the IR patterns for transpose: https://godbolt.org/z/PfcWrnss4

Roman

On Sun, Nov 21, 2021 at 5:59 PM Nicolas Vasilache via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>
> FYI, a commit is up for the inline asm change: https://reviews.llvm.org/D114335.
>
> On Sun, Nov 14, 2021 at 11:16 PM Nicolas Vasilache <ntv at google.com> wrote:
>>
>> Not yet, the InlineAsmOp in MLIR is still generally unused.
>> It has been used a bit in the IREE project though (https://github.com/google/iree/blob/49a81c60329437e64791ee1abd09d47fe1cde205/iree/compiler/Codegen/LLVMCPU/VectorContractToAArch64InlineAsmOp.cpp#L103).
>>
>> I should be be indeed able to intersperse my lowering (https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp#L124) with some InlineAsmOp uses.
>> I'll report back when I have something.
>>
>> On Sun, Nov 14, 2021 at 4:53 PM Simon Pilgrim <llvm-dev at redking.me.uk> wrote:
>>>
>>> Nicolas - have you investigated just using inline asm instead?
>>>
>>> On 11/11/2021 08:34, Wang, Pengfei via llvm-dev wrote:
>>>
>>> >As I am very new to this part of LLVM, I am not sure what is feasible or not. Would it be envisionnable to either:
>>>
>>> >1. have a way to inject some numeric cost to influence the value of some resulting combinations?
>>>
>>> >2. revive some form of intrinsic and guarantee that the instruction would be generated?
>>>
>>>
>>>
>>> I think a feasible way is to add a new tuningXXX feature for given targets and do something different with the flag in the combine.
>>>
>>> 1) seems overengineering and 2) seems overkilled for potential opportunities by the combine.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Phoebe
>>>
>>>
>>>
>>> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Nicolas Vasilache via llvm-dev
>>> Sent: Wednesday, November 10, 2021 5:46 PM
>>> To: Diego Caballero <diegocaballero at google.com>
>>> Cc: llvm-dev at lists.llvm.org
>>> Subject: Re: [llvm-dev] Understanding and controlling some of the AVX shuffle emission paths
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 10, 2021 at 10:30 AM Diego Caballero <diegocaballero at google.com> wrote:
>>>
>>> +Nicolas Vasilache :)
>>>
>>>
>>>
>>> Thanks Diego, email is hard, I could not find ways to inject myself into my own discussion...
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Nov 9, 2021 at 10:32 PM Simon Pilgrim via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>>>
>>> On 09/11/2021 20:44, Simon Pilgrim wrote:
>>>
>>> > On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote:
>>> >> Hi everyone,
>>> >>
>>> >> I am experimenting with LLVM lowering, intrinsics and shufflevector
>>> >> in general.
>>> >>
>>> >> Here is an IR that I produce with the objective of emitting some
>>> >> vblendps instructions:
>>> >> https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a.
>>> >>
>>> > From what I can see, the original IR code was (effectively):
>>> >
>>> > 8 x UNPCKLPS/UNPCKHPS
>>> > 4 x SHUFPS
>>> > 8 x BLENDPS
>>> > 4 x INSERTF128
>>> > 4 x PERM2F128
>>> >
>>> >> I compile this further with
>>> >>
>>> >> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3
>>> >> -mcpu=haswell - -o -
>>> >>
>>> >> to obtain:
>>> >>
>>> >> https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a
>>> >>
>>> >
>>> > and after the x86 shuffle combines:
>>> >
>>> > 8 x UNPCKLPS/UNPCKHPS
>>> > 8 x UNPCKLPD/UNPCKHPD
>>> > 4 x INSERTF128
>>> > 4 x PERM2F128
>>> >
>>> > Starting from each BLENDPS, they've combined with the SHUFPS to create
>>> > the UNPCK*PD nodes. We nearly always benefit from folding shuffle
>>> > chains to reduce total instruction counts, even if some inner nodes
>>> > have multiple uses (like the SHUFPS), and I'd hate to lose that.
>>> >
>>> >> At this point, I would expect to see some vblendps instructions
>>> >> generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55
>>> >> and %57/%58 to reduce pressure on port 5 (vblendps can also go on
>>> >> ports 0 and 1). However the expected instruction does not get
>>> >> generated and llvm-mca continues to show me high port 5 contention.
>>> >>
>>> >> Could people suggest some steps / commands to help better understand
>>> >> why my expectation is not met and whether I can do something to make
>>> >> the compiler generate what I want? Thanks in advance!
>>> > So on Haswell, we've gained 4 extra Port5-only shuffles but removed
>>> > the 8 Port015 blends.
>>> >
>>> > We have very little arch-specific shuffle combines, just the
>>> > fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask
>>> > loads, the shuffle combines just aims for the reduction in simple
>>> > target shuffle nodes. And tbh I'm reluctant to add to this as shuffle
>>> > combining is complex already.
>>> >
>>> > We should be preferring to lower/combine to BLENDPS in more
>>> > circumstances (its commutable and never slower than any other target
>>> > shuffle, although demanded elts can do less with 'undef' elements),
>>> > but that won't help us here.
>>> >
>>> > So far I've failed to find a BLEND-based 8x8 transpose pattern that
>>> > the shuffle combiner doesn't manage to combine back to the
>>> > 8xUNPCK/SHUFPS ops :(
>>>
>>>
>>>
>>> If you are referring to this specific code, yes same for me.
>>>
>>> If you are thinking about the general 8x8 transpose problem, I have a version with vector<4xf32> loads that ends up using blends; as expected, the port 5 pressure reduction helps and both llvm-mca and runtime agree that this is 20-30% faster.
>>>
>>>
>>>
>>>
>>> The only thing I can think of is you might want to see if you can
>>> reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and
>>> the SHUFPS/BLENDPS:
>>>
>>> 8 x UNPCKLPS/UNPCKHPS
>>> 4 x INSERTF128
>>> 4 x PERM2F128
>>> 4 x SHUFPS
>>> 8 x BLENDPS
>>>
>>> Splitting the per-lane shuffles with the subvector-shuffles could help
>>> stop the shuffle combiner.
>>>
>>>
>>>
>>> Right, I tried different variations here but invariably getting the same result.
>>>
>>> The vector<4xf32> based version is something that I also want to target for a bunch of orthogonal reasons.
>>>
>>> I'll note that my use case is MLIR codegen with explicit vectors and intrinsics -> LLVM so I have quite some flexibility.
>>>
>>> But it feels unnatural in the compiler flow to have to branch off at a significant higher-level of abstraction to sidestep concerns related to X86 microarchitecture details.
>>>
>>>
>>>
>>> As I am very new to this part of LLVM, I am not sure what is feasible or not. Would it be envisionnable to either:
>>>
>>> 1. have a way to inject some numeric cost to influence the value of some resulting combinations?
>>>
>>> 2. revive some form of intrinsic and guarantee that the instruction would be generated?
>>>
>>>
>>>
>>> I realize point 2. is contrary to the evolution of LLVM as these intrinsics were removed ca. 2015 in favor of the combiner-based approach.
>>>
>>> Still it seems that `we have very little arch-specific shuffle combines` could be the signal that such intrinsics are needed?
>>>
>>>
>>>
>>>
>>> >> I have verified independently that in isolation, a single such
>>> >> shuffle creates a vblendps. I see them being recombined in the
>>> >> produced assembly and I am looking for experimenting with avoiding
>>> >> that vshufps + vblendps + vblendps get recombined into vunpckxxx +
>>> >> vunpckxxx instructions.
>>> >>
>>> >> --
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>>
>>>
>>> --
>>>
>>> N
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>> --
>> N
>
>
>
> --
> N
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev