[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths

Tue Nov 9 13:32:11 PST 2021

On 09/11/2021 20:44, Simon Pilgrim wrote:

> On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote:
>> Hi everyone,
>>
>> I am experimenting with LLVM lowering, intrinsics and shufflevector 
>> in general.
>>
>> Here is an IR that I produce with the objective of emitting some 
>> vblendps instructions: 
>> https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a. 
>>
> From what I can see, the original IR code was (effectively):
>
> 8 x UNPCKLPS/UNPCKHPS
> 4 x SHUFPS
> 8 x BLENDPS
> 4 x INSERTF128
> 4 x PERM2F128
>
>> I compile this further with
>>
>> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3 
>> -mcpu=haswell - -o -
>>
>> to obtain:
>>
>> https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a 
>>
>
> and after the x86 shuffle combines:
>
> 8 x UNPCKLPS/UNPCKHPS
> 8 x UNPCKLPD/UNPCKHPD
> 4 x INSERTF128
> 4 x PERM2F128
>
> Starting from each BLENDPS, they've combined with the SHUFPS to create 
> the UNPCK*PD nodes. We nearly always benefit from folding shuffle 
> chains to reduce total instruction counts, even if some inner nodes 
> have multiple uses (like the SHUFPS), and I'd hate to lose that.
>
>> At this point, I would expect to see some vblendps instructions 
>> generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55 
>> and %57/%58 to reduce pressure on port 5 (vblendps can also go on 
>> ports 0 and 1). However the expected instruction does not get 
>> generated and llvm-mca continues to show me high port 5 contention.
>>
>> Could people suggest some steps / commands to help better understand 
>> why my expectation is not met and whether I can do something to make 
>> the compiler generate what I want? Thanks in advance!
> So on Haswell, we've gained 4 extra Port5-only shuffles but removed 
> the 8 Port015 blends.
>
> We have very little arch-specific shuffle combines, just the 
> fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask 
> loads, the shuffle combines just aims for the reduction in simple 
> target shuffle nodes. And tbh I'm reluctant to add to this as shuffle 
> combining is complex already.
>
> We should be preferring to lower/combine to BLENDPS in more 
> circumstances (its commutable and never slower than any other target 
> shuffle, although demanded elts can do less with 'undef' elements), 
> but that won't help us here.
>
> So far I've failed to find a BLEND-based 8x8 transpose pattern that 
> the shuffle combiner doesn't manage to combine back to the 
> 8xUNPCK/SHUFPS ops :(

The only thing I can think of is you might want to see if you can 
reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and 
the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128
4 x SHUFPS
8 x BLENDPS

Splitting the per-lane shuffles with the subvector-shuffles could help 
stop the shuffle combiner.

>> I have verified independently that in isolation, a single such 
>> shuffle creates a vblendps. I see them being recombined in the 
>> produced assembly and I am looking for experimenting with avoiding 
>> that vshufps + vblendps + vblendps get recombined into vunpckxxx + 
>> vunpckxxx instructions.
>>
>> --