<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Nov 10, 2021 at 10:30 AM Diego Caballero <<a href="mailto:diegocaballero@google.com">diegocaballero@google.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><a class="gmail_plusreply" id="gmail-m_-11397725050618626plusReplyChip-0" href="mailto:ntv@google.com" target="_blank">+Nicolas Vasilache</a> :)<br></div></blockquote><div><br></div><div>Thanks Diego, email is hard, I could not find ways to inject myself into my own discussion...</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Nov 9, 2021 at 10:32 PM Simon Pilgrim via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 09/11/2021 20:44, Simon Pilgrim wrote:<br>

<br>

> On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote:<br>

>> Hi everyone,<br>

>><br>

>> I am experimenting with LLVM lowering, intrinsics and shufflevector <br>

>> in general.<br>

>><br>

>> Here is an IR that I produce with the objective of emitting some <br>

>> vblendps instructions: <br>

>> <a href="https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a" rel="noreferrer" target="_blank">https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a</a>. <br>

>><br>

> From what I can see, the original IR code was (effectively):<br>

><br>

> 8 x UNPCKLPS/UNPCKHPS<br>

> 4 x SHUFPS<br>

> 8 x BLENDPS<br>

> 4 x INSERTF128<br>

> 4 x PERM2F128<br>

><br>

>> I compile this further with<br>

>><br>

>> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3 <br>

>> -mcpu=haswell - -o -<br>

>><br>

>> to obtain:<br>

>><br>

>> <a href="https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a" rel="noreferrer" target="_blank">https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a</a> <br>

>><br>

><br>

> and after the x86 shuffle combines:<br>

><br>

> 8 x UNPCKLPS/UNPCKHPS<br>

> 8 x UNPCKLPD/UNPCKHPD<br>

> 4 x INSERTF128<br>

> 4 x PERM2F128<br>

><br>

> Starting from each BLENDPS, they've combined with the SHUFPS to create <br>

> the UNPCK*PD nodes. We nearly always benefit from folding shuffle <br>

> chains to reduce total instruction counts, even if some inner nodes <br>

> have multiple uses (like the SHUFPS), and I'd hate to lose that.<br>

><br>

>> At this point, I would expect to see some vblendps instructions <br>

>> generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55 <br>

>> and %57/%58 to reduce pressure on port 5 (vblendps can also go on <br>

>> ports 0 and 1). However the expected instruction does not get <br>

>> generated and llvm-mca continues to show me high port 5 contention.<br>

>><br>

>> Could people suggest some steps / commands to help better understand <br>

>> why my expectation is not met and whether I can do something to make <br>

>> the compiler generate what I want? Thanks in advance!<br>

> So on Haswell, we've gained 4 extra Port5-only shuffles but removed <br>

> the 8 Port015 blends.<br>

><br>

> We have very little arch-specific shuffle combines, just the <br>

> fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask <br>

> loads, the shuffle combines just aims for the reduction in simple <br>

> target shuffle nodes. And tbh I'm reluctant to add to this as shuffle <br>

> combining is complex already.<br>

><br>

> We should be preferring to lower/combine to BLENDPS in more <br>

> circumstances (its commutable and never slower than any other target <br>

> shuffle, although demanded elts can do less with 'undef' elements), <br>

> but that won't help us here.<br>

><br>

> So far I've failed to find a BLEND-based 8x8 transpose pattern that <br>

> the shuffle combiner doesn't manage to combine back to the <br>

> 8xUNPCK/SHUFPS ops :(<br></blockquote></div></blockquote><div><br></div><div>If you are referring to this specific code, yes same for me.</div><div>If you are thinking about the general 8x8 transpose problem, I have a version with vector<4xf32> loads that ends up using blends; as expected, the port 5 pressure reduction helps and both llvm-mca and runtime agree that this is 20-30% faster.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

The only thing I can think of is you might want to see if you can <br>

reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and <br>

the SHUFPS/BLENDPS:<br>

<br>

8 x UNPCKLPS/UNPCKHPS<br>

4 x INSERTF128<br>

4 x PERM2F128<br>

4 x SHUFPS<br>

8 x BLENDPS<br>

<br>

Splitting the per-lane shuffles with the subvector-shuffles could help <br>

stop the shuffle combiner.<br></blockquote></div></blockquote><div><br></div><div>Right, I tried different variations here but invariably getting the same result.</div><div>The vector<4xf32> based version is something that I also want to target for a bunch of orthogonal reasons.</div><div>I'll note that my use case is MLIR codegen with explicit vectors and intrinsics -> LLVM so I have quite some flexibility.</div>But it feels unnatural in the compiler flow to have to branch off at a significant higher-level of abstraction to sidestep concerns related to X86 microarchitecture details.<div><br></div><div>As I am very new to this part of LLVM, I am not sure what is feasible or not. Would it be envisionnable to either:</div><div>1. have a way to inject some numeric cost to influence the value of some resulting combinations?</div><div>2. revive some form of intrinsic and guarantee that the instruction would be generated?</div><div><br></div><div>I realize point 2. is contrary to the evolution of LLVM as these intrinsics were removed ca. 2015 in favor of the combiner-based approach.</div><div>Still it seems that `we have very little arch-specific shuffle combines` could be the signal that such intrinsics are needed?</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

>> I have verified independently that in isolation, a single such <br>

>> shuffle creates a vblendps. I see them being recombined in the <br>

>> produced assembly and I am looking for experimenting with avoiding <br>

>> that vshufps + vblendps + vblendps get recombined into vunpckxxx + <br>

>> vunpckxxx instructions.<br>

>><br>

>> -- <br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</blockquote></div>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">N</div></div></div>