<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


<meta name="Generator" content="Microsoft Word 15 (filtered medium)">


<style><!--


/* Font Definitions */


@font-face


        {font-family:SimSun;


        panose-1:2 1 6 0 3 1 1 1 1 1;}


@font-face


        {font-family:"Cambria Math";


        panose-1:2 4 5 3 5 4 6 3 2 4;}


@font-face


        {font-family:Calibri;


        panose-1:2 15 5 2 2 2 4 3 2 4;}


@font-face


        {font-family:"\@SimSun";


        panose-1:2 1 6 0 3 1 1 1 1 1;}


/* Style Definitions */


p.MsoNormal, li.MsoNormal, div.MsoNormal


        {margin:0in;


        font-size:11.0pt;


        font-family:"Calibri",sans-serif;}


a:link, span.MsoHyperlink


        {mso-style-priority:99;


        color:blue;


        text-decoration:underline;}


span.EmailStyle20


        {mso-style-type:personal-compose;


        font-family:"Calibri",sans-serif;


        color:windowtext;


        font-weight:normal;


        font-style:normal;


        text-decoration:none none;}


.MsoChpDefault


        {mso-style-type:export-only;


        font-family:"Calibri",sans-serif;}


@page WordSection1


        {size:8.5in 11.0in;


        margin:1.0in 1.0in 1.0in 1.0in;}


div.WordSection1


        {page:WordSection1;}


--></style><!--[if gte mso 9]><xml>


<o:shapedefaults v:ext="edit" spidmax="1026" />


</xml><![endif]--><!--[if gte mso 9]><xml>


<o:shapelayout v:ext="edit">


<o:idmap v:ext="edit" data="1" />


</o:shapelayout></xml><![endif]-->


</head>


<body lang="EN-US" link="blue" vlink="purple" style="word-wrap:break-word">


<div class="WordSection1">


<p class="MsoNormal">>As I am very new to this part of LLVM, I am not sure what is feasible or not. Would it be envisionnable to either:<o:p></o:p></p>


<p class="MsoNormal">>1. have a way to inject some numeric cost to influence the value of some resulting combinations?<o:p></o:p></p>


<p class="MsoNormal">>2. revive some form of intrinsic and guarantee that the instruction would be generated?<o:p></o:p></p>


<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>


<p class="MsoNormal"><span style="color:#1F497D">I think a feasible way is to add a new tuningXXX feature for given targets and do something different with the flag in the combine.<o:p></o:p></span></p>


<p class="MsoNormal"><span style="color:#1F497D">1) seems overengineering and 2) seems overkilled for potential opportunities by the combine.<o:p></o:p></span></p>


<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>


<p class="MsoNormal"><span style="color:#1F497D">Thanks<o:p></o:p></span></p>


<p class="MsoNormal"><span style="color:#1F497D">Phoebe<o:p></o:p></span></p>


<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>


<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">


<p class="MsoNormal"><b>From:</b> llvm-dev <llvm-dev-bounces@lists.llvm.org> <b>On Behalf Of


</b>Nicolas Vasilache via llvm-dev<br>


<b>Sent:</b> Wednesday, November 10, 2021 5:46 PM<br>


<b>To:</b> Diego Caballero <diegocaballero@google.com><br>


<b>Cc:</b> llvm-dev@lists.llvm.org<br>


<b>Subject:</b> Re: [llvm-dev] Understanding and controlling some of the AVX shuffle emission paths<o:p></o:p></p>


</div>


<p class="MsoNormal"><o:p> </o:p></p>


<div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<p class="MsoNormal"><o:p> </o:p></p>


<div>


<div>


<p class="MsoNormal">On Wed, Nov 10, 2021 at 10:30 AM Diego Caballero <<a href="mailto:diegocaballero@google.com">diegocaballero@google.com</a>> wrote:<o:p></o:p></p>


</div>


<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">


<div>


<p class="MsoNormal"><a href="mailto:ntv@google.com" target="_blank">+Nicolas Vasilache</a> :)<o:p></o:p></p>


</div>


</blockquote>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">Thanks Diego, email is hard, I could not find ways to inject myself into my own discussion...<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"> <o:p></o:p></p>


</div>


<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">


<p class="MsoNormal"><o:p> </o:p></p>


<div>


<div>


<p class="MsoNormal">On Tue, Nov 9, 2021 at 10:32 PM Simon Pilgrim via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<o:p></o:p></p>


</div>


<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">


<p class="MsoNormal">On 09/11/2021 20:44, Simon Pilgrim wrote:<br>


<br>


> On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote:<br>


>> Hi everyone,<br>


>><br>


>> I am experimenting with LLVM lowering, intrinsics and shufflevector <br>


>> in general.<br>


>><br>


>> Here is an IR that I produce with the objective of emitting some <br>


>> vblendps instructions: <br>


>> <a href="https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a" target="_blank">


https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a</a>. <br>


>><br>


> From what I can see, the original IR code was (effectively):<br>


><br>


> 8 x UNPCKLPS/UNPCKHPS<br>


> 4 x SHUFPS<br>


> 8 x BLENDPS<br>


> 4 x INSERTF128<br>


> 4 x PERM2F128<br>


><br>


>> I compile this further with<br>


>><br>


>> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3 <br>


>> -mcpu=haswell - -o -<br>


>><br>


>> to obtain:<br>


>><br>


>> <a href="https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a" target="_blank">


https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a</a> <br>


>><br>


><br>


> and after the x86 shuffle combines:<br>


><br>


> 8 x UNPCKLPS/UNPCKHPS<br>


> 8 x UNPCKLPD/UNPCKHPD<br>


> 4 x INSERTF128<br>


> 4 x PERM2F128<br>


><br>


> Starting from each BLENDPS, they've combined with the SHUFPS to create <br>


> the UNPCK*PD nodes. We nearly always benefit from folding shuffle <br>


> chains to reduce total instruction counts, even if some inner nodes <br>


> have multiple uses (like the SHUFPS), and I'd hate to lose that.<br>


><br>


>> At this point, I would expect to see some vblendps instructions <br>


>> generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55 <br>


>> and %57/%58 to reduce pressure on port 5 (vblendps can also go on <br>


>> ports 0 and 1). However the expected instruction does not get <br>


>> generated and llvm-mca continues to show me high port 5 contention.<br>


>><br>


>> Could people suggest some steps / commands to help better understand <br>


>> why my expectation is not met and whether I can do something to make <br>


>> the compiler generate what I want? Thanks in advance!<br>


> So on Haswell, we've gained 4 extra Port5-only shuffles but removed <br>


> the 8 Port015 blends.<br>


><br>


> We have very little arch-specific shuffle combines, just the <br>


> fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask <br>


> loads, the shuffle combines just aims for the reduction in simple <br>


> target shuffle nodes. And tbh I'm reluctant to add to this as shuffle <br>


> combining is complex already.<br>


><br>


> We should be preferring to lower/combine to BLENDPS in more <br>


> circumstances (its commutable and never slower than any other target <br>


> shuffle, although demanded elts can do less with 'undef' elements), <br>


> but that won't help us here.<br>


><br>


> So far I've failed to find a BLEND-based 8x8 transpose pattern that <br>


> the shuffle combiner doesn't manage to combine back to the <br>


> 8xUNPCK/SHUFPS ops :(<o:p></o:p></p>


</blockquote>


</div>


</blockquote>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">If you are referring to this specific code, yes same for me.<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal">If you are thinking about the general 8x8 transpose problem, I have a version with vector<4xf32> loads that ends up using blends; as expected, the port 5 pressure reduction helps and both llvm-mca and runtime agree that this is 20-30% faster.<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"> <o:p></o:p></p>


</div>


<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">


<div>


<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">


<p class="MsoNormal"><br>


The only thing I can think of is you might want to see if you can <br>


reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and <br>


the SHUFPS/BLENDPS:<br>


<br>


8 x UNPCKLPS/UNPCKHPS<br>


4 x INSERTF128<br>


4 x PERM2F128<br>


4 x SHUFPS<br>


8 x BLENDPS<br>


<br>


Splitting the per-lane shuffles with the subvector-shuffles could help <br>


stop the shuffle combiner.<o:p></o:p></p>


</blockquote>


</div>


</blockquote>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">Right, I tried different variations here but invariably getting the same result.<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal">The vector<4xf32> based version is something that I also want to target for a bunch of orthogonal reasons.<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal">I'll note that my use case is MLIR codegen with explicit vectors and intrinsics -> LLVM so I have quite some flexibility.<o:p></o:p></p>


</div>


<p class="MsoNormal">But it feels unnatural in the compiler flow to have to branch off at a significant higher-level of abstraction to sidestep concerns related to X86 microarchitecture details.<o:p></o:p></p>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">As I am very new to this part of LLVM, I am not sure what is feasible or not. Would it be envisionnable to either:<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal">1. have a way to inject some numeric cost to influence the value of some resulting combinations?<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal">2. revive some form of intrinsic and guarantee that the instruction would be generated?<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">I realize point 2. is contrary to the evolution of LLVM as these intrinsics were removed ca. 2015 in favor of the combiner-based approach.<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal">Still it seems that `we have very little arch-specific shuffle combines` could be the signal that such intrinsics are needed?<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"> <o:p></o:p></p>


</div>


<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">


<div>


<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">


<p class="MsoNormal"><br>


>> I have verified independently that in isolation, a single such <br>


>> shuffle creates a vblendps. I see them being recombined in the <br>


>> produced assembly and I am looking for experimenting with avoiding <br>


>> that vshufps + vblendps + vblendps get recombined into vunpckxxx + <br>


>> vunpckxxx instructions.<br>


>><br>


>> -- <br>


_______________________________________________<br>


LLVM Developers mailing list<br>


<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>


<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></p>


</blockquote>


</div>


</blockquote>


</div>


<p class="MsoNormal"><br clear="all">


<o:p></o:p></p>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<p class="MsoNormal">-- <o:p></o:p></p>


<div>


<div>


<p class="MsoNormal">N<o:p></o:p></p>


</div>


</div>


</div>


</div>


</body>


</html>