<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Nicolas - have you investigated just using inline asm instead?<br>
</p>
<div class="moz-cite-prefix">On 11/11/2021 08:34, Wang, Pengfei via
llvm-dev wrote:<br>
</div>
<blockquote type="cite"
cite="mid:PH0PR11MB5627C7B528EF0BFA2AED73E188949@PH0PR11MB5627.namprd11.prod.outlook.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style>@font-face
{font-family:SimSun;
panose-1:2 1 6 0 3 1 1 1 1 1;}@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face
{font-family:"\@SimSun";
panose-1:2 1 6 0 3 1 1 1 1 1;}p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}span.EmailStyle20
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;
font-weight:normal;
font-style:normal;
text-decoration:none none;}.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;}div.WordSection1
{page:WordSection1;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal">>As I am very new to this part of LLVM,
I am not sure what is feasible or not. Would it be
envisionnable to either:<o:p></o:p></p>
<p class="MsoNormal">>1. have a way to inject some numeric
cost to influence the value of some resulting combinations?<o:p></o:p></p>
<p class="MsoNormal">>2. revive some form of intrinsic and
guarantee that the instruction would be generated?<o:p></o:p></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">I think a
feasible way is to add a new tuningXXX feature for given
targets and do something different with the flag in the
combine.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">1) seems
overengineering and 2) seems overkilled for potential
opportunities by the combine.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">Thanks<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">Phoebe<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> llvm-dev
<a class="moz-txt-link-rfc2396E" href="mailto:llvm-dev-bounces@lists.llvm.org"><llvm-dev-bounces@lists.llvm.org></a> <b>On Behalf Of
</b>Nicolas Vasilache via llvm-dev<br>
<b>Sent:</b> Wednesday, November 10, 2021 5:46 PM<br>
<b>To:</b> Diego Caballero <a class="moz-txt-link-rfc2396E" href="mailto:diegocaballero@google.com"><diegocaballero@google.com></a><br>
<b>Cc:</b> <a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>
<b>Subject:</b> Re: [llvm-dev] Understanding and controlling
some of the AVX shuffle emission paths<o:p></o:p></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Wed, Nov 10, 2021 at 10:30 AM
Diego Caballero <<a
href="mailto:diegocaballero@google.com"
moz-do-not-send="true" class="moz-txt-link-freetext">diegocaballero@google.com</a>>
wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-right:0in">
<div>
<p class="MsoNormal"><a href="mailto:ntv@google.com"
target="_blank" moz-do-not-send="true">+Nicolas
Vasilache</a> :)<o:p></o:p></p>
</div>
</blockquote>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Thanks Diego, email is hard, I could
not find ways to inject myself into my own discussion...<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-right:0in">
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Tue, Nov 9, 2021 at 10:32 PM
Simon Pilgrim via llvm-dev <<a
href="mailto:llvm-dev@lists.llvm.org"
target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">llvm-dev@lists.llvm.org</a>>
wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-right:0in">
<p class="MsoNormal">On 09/11/2021 20:44, Simon
Pilgrim wrote:<br>
<br>
> On 09/11/2021 08:57, Nicolas Vasilache via
llvm-dev wrote:<br>
>> Hi everyone,<br>
>><br>
>> I am experimenting with LLVM lowering,
intrinsics and shufflevector <br>
>> in general.<br>
>><br>
>> Here is an IR that I produce with the
objective of emitting some <br>
>> vblendps instructions: <br>
>> <a
href="https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a"
target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">
https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a</a>.
<br>
>><br>
> From what I can see, the original IR code was
(effectively):<br>
><br>
> 8 x UNPCKLPS/UNPCKHPS<br>
> 4 x SHUFPS<br>
> 8 x BLENDPS<br>
> 4 x INSERTF128<br>
> 4 x PERM2F128<br>
><br>
>> I compile this further with<br>
>><br>
>> clang -x ir -emit-llvm -S -mcpu=haswell -O3
-o - | llc -O3 <br>
>> -mcpu=haswell - -o -<br>
>><br>
>> to obtain:<br>
>><br>
>> <a
href="https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a"
target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">
https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a</a>
<br>
>><br>
><br>
> and after the x86 shuffle combines:<br>
><br>
> 8 x UNPCKLPS/UNPCKHPS<br>
> 8 x UNPCKLPD/UNPCKHPD<br>
> 4 x INSERTF128<br>
> 4 x PERM2F128<br>
><br>
> Starting from each BLENDPS, they've combined
with the SHUFPS to create <br>
> the UNPCK*PD nodes. We nearly always benefit
from folding shuffle <br>
> chains to reduce total instruction counts, even
if some inner nodes <br>
> have multiple uses (like the SHUFPS), and I'd
hate to lose that.<br>
><br>
>> At this point, I would expect to see some
vblendps instructions <br>
>> generated for the pieces of IR that produce
%48/%49 %51/%52 %54/%55 <br>
>> and %57/%58 to reduce pressure on port 5
(vblendps can also go on <br>
>> ports 0 and 1). However the expected
instruction does not get <br>
>> generated and llvm-mca continues to show me
high port 5 contention.<br>
>><br>
>> Could people suggest some steps / commands
to help better understand <br>
>> why my expectation is not met and whether I
can do something to make <br>
>> the compiler generate what I want? Thanks
in advance!<br>
> So on Haswell, we've gained 4 extra Port5-only
shuffles but removed <br>
> the 8 Port015 blends.<br>
><br>
> We have very little arch-specific shuffle
combines, just the <br>
> fast-variable-shuffle tuning flags to avoid
unnecessary shuffle mask <br>
> loads, the shuffle combines just aims for the
reduction in simple <br>
> target shuffle nodes. And tbh I'm reluctant to
add to this as shuffle <br>
> combining is complex already.<br>
><br>
> We should be preferring to lower/combine to
BLENDPS in more <br>
> circumstances (its commutable and never slower
than any other target <br>
> shuffle, although demanded elts can do less
with 'undef' elements), <br>
> but that won't help us here.<br>
><br>
> So far I've failed to find a BLEND-based 8x8
transpose pattern that <br>
> the shuffle combiner doesn't manage to combine
back to the <br>
> 8xUNPCK/SHUFPS ops :(<o:p></o:p></p>
</blockquote>
</div>
</blockquote>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">If you are referring to this specific
code, yes same for me.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">If you are thinking about the general
8x8 transpose problem, I have a version with
vector<4xf32> loads that ends up using blends; as
expected, the port 5 pressure reduction helps and both
llvm-mca and runtime agree that this is 20-30% faster.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-right:0in">
<div>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-right:0in">
<p class="MsoNormal"><br>
The only thing I can think of is you might want to
see if you can <br>
reorder the INSERTF128/PERM2F128 shuffles in between
the UNPACK*PS and <br>
the SHUFPS/BLENDPS:<br>
<br>
8 x UNPCKLPS/UNPCKHPS<br>
4 x INSERTF128<br>
4 x PERM2F128<br>
4 x SHUFPS<br>
8 x BLENDPS<br>
<br>
Splitting the per-lane shuffles with the
subvector-shuffles could help <br>
stop the shuffle combiner.<o:p></o:p></p>
</blockquote>
</div>
</blockquote>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Right, I tried different variations
here but invariably getting the same result.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">The vector<4xf32> based version
is something that I also want to target for a bunch of
orthogonal reasons.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">I'll note that my use case is MLIR
codegen with explicit vectors and intrinsics -> LLVM
so I have quite some flexibility.<o:p></o:p></p>
</div>
<p class="MsoNormal">But it feels unnatural in the compiler
flow to have to branch off at a significant higher-level
of abstraction to sidestep concerns related to X86
microarchitecture details.<o:p></o:p></p>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">As I am very new to this part of
LLVM, I am not sure what is feasible or not. Would it be
envisionnable to either:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">1. have a way to inject some numeric
cost to influence the value of some resulting
combinations?<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">2. revive some form of intrinsic and
guarantee that the instruction would be generated?<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">I realize point 2. is contrary to the
evolution of LLVM as these intrinsics were removed ca.
2015 in favor of the combiner-based approach.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Still it seems that `we have very
little arch-specific shuffle combines` could be the
signal that such intrinsics are needed?<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-right:0in">
<div>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-right:0in">
<p class="MsoNormal"><br>
>> I have verified independently that in
isolation, a single such <br>
>> shuffle creates a vblendps. I see them
being recombined in the <br>
>> produced assembly and I am looking for
experimenting with avoiding <br>
>> that vshufps + vblendps + vblendps get
recombined into vunpckxxx + <br>
>> vunpckxxx instructions.<br>
>><br>
>> -- <br>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org"
target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">llvm-dev@lists.llvm.org</a><br>
<a
href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev"
target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></p>
</blockquote>
</div>
</blockquote>
</div>
<p class="MsoNormal"><br clear="all">
<o:p></o:p></p>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<p class="MsoNormal">-- <o:p></o:p></p>
<div>
<div>
<p class="MsoNormal">N<o:p></o:p></p>
</div>
</div>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
LLVM Developers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>
<a class="moz-txt-link-freetext" href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>
</pre>
</blockquote>
</body>
</html>