<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 8/19/20 3:09 PM, Topper, Craig
wrote:<br>
</div>
<blockquote type="cite" cite="mid:MWHPR11MB00460EDD0515496504C1D4A8935D0@MWHPR11MB0046.namprd11.prod.outlook.com">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin-top:0in;
margin-right:0in;
margin-bottom:8.0pt;
margin-left:0in;
line-height:105%;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
pre
{mso-style-priority:99;
mso-style-link:"HTML Preformatted Char";
margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Courier New";}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0in;
margin-right:0in;
margin-bottom:8.0pt;
margin-left:0in;
text-indent:21.0pt;
line-height:105%;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.HTMLPreformattedChar
{mso-style-name:"HTML Preformatted Char";
mso-style-priority:99;
mso-style-link:"HTML Preformatted";
font-family:"Courier New";}
span.EmailStyle23
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}mso-level-tab-stop:4.5in;
mso-level-number-position:left;
text-indent:-.25in;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal">The width and height can be runtime values
that we would just copy into 64 byte configuration block we
pass to ldtilecfg. So the code doesn’t need to be
multiversioned. The user code would also use those values to
update pointers in the loops they write using the tiles. If we
can’t determine that two tiles were defined with the same
width and height we need to assume the shape is different and
try to avoid ever giving the same tile.<o:p></o:p></p>
<p class="MsoNormal">Hal, for your suggestion would which
physical registers are in which register class be defined
dynamically before register allocation?</p>
</div>
</blockquote>
<p><br>
</p>
<p>Here's my thought:<br>
</p>
<p>First, you have a set of intrinsics that take tile values along
with tile configuration parameters (which, presently, seem just to
be the sizes). These get lowered into pseudo-instructions that do
the same. Thus, you have some register class that represents these
arbitrarily-sized tile registers that you'll assign to these
pseudo-instruction operands (i.e., they take virtual tile
registers right after instruction selection). You might use the
16x16 tile register class for this purpose, but it shouldn't
really matter.<br>
</p>
<p>Second, you run this configuration-placement pass. This pass
looks at all of the AMX pseudo-instructions and identifies regions
in which the pseudo-instructions use the same configuration
parameters (i.e., the same SSA values and/or constants). This pass
might reorder the pseudo-instructions when legal in order to form
larger regions. Then it places the ldtilecfg at the start of each
region (in some common dominating position). ldtilecfg implicitly
defines all of the tile registers in every concrete class of tile
registers (all 256 of them, or whatever). The pseudo-instructions
are replaced by real MI instructions taking a tile register class
appropriate for the configuration (which will default to the 16x16
class for cases where the configuration is not a
compile-time-known constant). When the configuration is a known
constant, the instructions take operands with a register class
appropriate for that configuration (e.g., 1x1, 4x4).</p>
<p>Third, the rest of the framework runs as usual. Tile registers
from the appropriate class are allocated by the register
allocator. No live range of any virtual tile register can pass
through the ldtilecfg (because it defines them all), but that's
okay, none of live ranges will by construction (the
configuration-placement pass ensures this).<br>
</p>
<p>Thanks again,</p>
<p>Hal<br>
</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:MWHPR11MB00460EDD0515496504C1D4A8935D0@MWHPR11MB0046.namprd11.prod.outlook.com">
<div class="WordSection1">
<p class="MsoNormal"><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Hal Finkel <a class="moz-txt-link-rfc2396E" href="mailto:hfinkel@anl.gov"><hfinkel@anl.gov></a> <br>
<b>Sent:</b> Wednesday, August 19, 2020 12:52 PM<br>
<b>To:</b> Kaylor, Andrew <a class="moz-txt-link-rfc2396E" href="mailto:andrew.kaylor@intel.com"><andrew.kaylor@intel.com></a>;
Luo, Yuanke <a class="moz-txt-link-rfc2396E" href="mailto:yuanke.luo@intel.com"><yuanke.luo@intel.com></a>; Philip Reames
<a class="moz-txt-link-rfc2396E" href="mailto:listmail@philipreames.com"><listmail@philipreames.com></a>;
<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>; <a class="moz-txt-link-abbreviated" href="mailto:florian_hahn@apple.com">florian_hahn@apple.com</a>; Topper,
Craig <a class="moz-txt-link-rfc2396E" href="mailto:craig.topper@intel.com"><craig.topper@intel.com></a>; Lu, Hongjiu
<a class="moz-txt-link-rfc2396E" href="mailto:hongjiu.lu@intel.com"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX programming model
discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p><o:p> </o:p></p>
<div>
<p class="MsoNormal">On 8/19/20 10:24 AM, Kaylor, Andrew
wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p>> When the tile shape is unknown at compile time, how do
you plan to do the register allocation of the tiles? My
question is: do you do the allocation for this case in the
same way as you would if you knew the size was 16x16 (i.e.,
conservatively assume the largest size)?<o:p></o:p></p>
<p class="MsoNormal">I think what will happen is that the
registers are allocated based on a number of runtime values
that are assumed to be different from one another but less
than or equal to 16. So, for example, we’ll allocate
registers for MxN tiles, NxM tiles and MxM tiles without
knowing what M and N are. Then at runtime the values of
these variables will be used to create the actual tile
configuration. The instructions that need to know the shape
take these runtime values as operands.<o:p></o:p></p>
</blockquote>
<p><o:p> </o:p></p>
<p>So you're going to multiversion the code?<o:p></o:p></p>
<p>In any case, my point is that you probably don't need a
custom register allocator. If you just define the tile
registers and make sure that the ldtilecfgs implicitly defines
them all, then the regular infrastructure likely works. You'll
have a bunch of register classes, but that's not necessarily a
problem. I recommend trying this, and let us know what you
discover, before we go down the road of a new, dedicated
allocator just for these registers.<o:p></o:p></p>
<p> -Hal<o:p></o:p></p>
<p><o:p> </o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">There may be some artifacts coming from
the front end that conservatively assume a 16x16 tile, but I
think those generally go away in SROA or later specialized
passes. Yuanke can confirm or correct my understanding of
this.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Hal Finkel <a
href="mailto:hfinkel@anl.gov" moz-do-not-send="true"><hfinkel@anl.gov></a>
<br>
<b>Sent:</b> Wednesday, August 19, 2020 5:14 AM<br>
<b>To:</b> Luo, Yuanke <a
href="mailto:yuanke.luo@intel.com"
moz-do-not-send="true"><yuanke.luo@intel.com></a>;
Kaylor, Andrew
<a href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true"><andrew.kaylor@intel.com></a>;
Philip Reames
<a href="mailto:listmail@philipreames.com"
moz-do-not-send="true"><listmail@philipreames.com></a>;
<a href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">
llvm-dev@lists.llvm.org</a>; <a
href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">florian_hahn@apple.com</a>;
Topper, Craig
<a href="mailto:craig.topper@intel.com"
moz-do-not-send="true"><craig.topper@intel.com></a>;
Lu, Hongjiu
<a href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX programming
model discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p> <o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/19/20 5:34 AM, Luo, Yuanke wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">There is no problem to have 256
register classes. Just a lot of register classes to me.<o:p></o:p></p>
<p class="MsoNormal">We don’t assume the shape of each
physical register be 16x16, it is defined by user. For
variable shape, I mean the shape is known in runtime and
in compile time the shape is unknown. Take below code as
an example, the %row and %col are variable instead of
constant. Compiler recognizes llvm.x86.tileloadd64 and
deduce the shape of %0 is %row x %col.<o:p></o:p></p>
<p class="MsoNormal">%0 = tail call <256 x i32>
@llvm.x86.tileloadd64(i16 %row, i16 %col, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf,
i64 0, i64 0), i64 32)<o:p></o:p></p>
</blockquote>
<p> <o:p></o:p></p>
<p>When the tile shape is unknown at compile time, how do you
plan to do the register allocation of the tiles? My question
is: do you do the allocation for this case in the same way
as you would if you knew the size was 16x16 (i.e.,
conservatively assume the largest size)?<o:p></o:p></p>
<p>Thanks again,<o:p></o:p></p>
<p>Hal<o:p></o:p></p>
<p> <o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Hal Finkel <a
href="mailto:hfinkel@anl.gov" moz-do-not-send="true"><hfinkel@anl.gov></a>
<br>
<b>Sent:</b> Wednesday, August 19, 2020 4:58 PM<br>
<b>To:</b> Luo, Yuanke <a
href="mailto:yuanke.luo@intel.com"
moz-do-not-send="true"><yuanke.luo@intel.com></a>;
Kaylor, Andrew
<a href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true"><andrew.kaylor@intel.com></a>;
Philip Reames
<a href="mailto:listmail@philipreames.com"
moz-do-not-send="true"><listmail@philipreames.com></a>;
<a href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">
llvm-dev@lists.llvm.org</a>; <a
href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">florian_hahn@apple.com</a>;
Topper, Craig
<a href="mailto:craig.topper@intel.com"
moz-do-not-send="true"><craig.topper@intel.com></a>;
Lu, Hongjiu
<a href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX programming
model discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p> <o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/19/20 2:21 AM, Luo, Yuanke
wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Hi Hal,<o:p></o:p></p>
<p class="MsoNormal">There is 3 aspect to be solved. <o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l1
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">1.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->The HW support max shape
16x16, so there are many register classes from 1x1 to
16x16. We need 256 register classes.
<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l1
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">2.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->We want to support
variable shape, so compiler don’t know what register
class to fit tile shape as it is only known in runtime.<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l1
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">3.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->The tile configure is to
configure physical tile register, so we need to allocate
register and then we know the shape of each physical
tile register and configure the tile register.<o:p></o:p></p>
<p class="MsoNormal">I think your suggestion is helpful to
reduce the complexity if we only support fixed
(constant) tile shape.<o:p></o:p></p>
<p class="MsoNormal">-Yuanke<o:p></o:p></p>
</blockquote>
<p> <o:p></o:p></p>
<p>Thanks, Yuanke.<o:p></o:p></p>
<p>It's not clear to me that having 256 register classes is,
in itself, a problem. Is it?<o:p></o:p></p>
<p>What does it mean to support variable-shape tiles in this
context? Do you do something other than conservatively
assume that they are 16x16 for register-allocation
purposes?<o:p></o:p></p>
<p> -Hal<o:p></o:p></p>
<p> <o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Hal Finkel <a
href="mailto:hfinkel@anl.gov"
moz-do-not-send="true"><hfinkel@anl.gov></a>
<br>
<b>Sent:</b> Wednesday, August 19, 2020 8:20 AM<br>
<b>To:</b> Kaylor, Andrew <a
href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true"><andrew.kaylor@intel.com></a>;
Philip Reames
<a href="mailto:listmail@philipreames.com"
moz-do-not-send="true"><listmail@philipreames.com></a>;
Luo, Yuanke
<a href="mailto:yuanke.luo@intel.com"
moz-do-not-send="true"><yuanke.luo@intel.com></a>;
<a href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">
llvm-dev@lists.llvm.org</a>; <a
href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">florian_hahn@apple.com</a>;
Topper, Craig
<a href="mailto:craig.topper@intel.com"
moz-do-not-send="true"><craig.topper@intel.com></a>;
Lu, Hongjiu
<a href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX programming
model discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p>Hi, Andy,<o:p></o:p></p>
<p>I don't quite understand everything that's going on
here. Could we model this as:<o:p></o:p></p>
<p> 1. Define a collection of register classes, one for
2x4 tiles, one for 4x2 tiles, etc. each populated with a
set of tile registers. Registers can have aliasing
relationships (instead of worrying of any kind of
subregister/superregister relationships -- these won't
be useful anyway).<o:p></o:p></p>
<p> 2. Define the tile-configuration instructions so that
they implicitly define all of the registers in all of
the classes.<o:p></o:p></p>
<p>Then you would still need to pre-schedule the tile
operations as you've described, and collect the
configuration information in order to add the
ldtilecfgs, but the regular register allocator can
handle the allocation itself in the usual way. What do
you think?<o:p></o:p></p>
<p> -Hal<o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/18/20 6:58 PM, Kaylor, Andrew
via llvm-dev wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
The AMX registers are complicated. The single
configuration register (which is mostly used
implicitly, similar to MXCSR for floating point)
controls the shape of all the tile registers, and if
you change the tile configuration every single tile
register is cleared. In practice, if we have to change
the the configuration while any of the tile registers
are live, performance is going to be terrible. We need
to handle this case for correctness, but users of this
programming interface will need to have enough
awareness of the performance issues and the hardware
details to prevent this. We’ll also want a diagnostic
that lets the user know when this has happened.<o:p></o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<o:p></o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
When the tile configuration is set, the shape of each
tile is locked in, so the individual tile registers
aren’t interchangeable at that point. If a function
needs 2x4 tiles, 4x2 tiles, and 4x4 tiles, the
configuration needs to be set with this in mind. The
shape isn’t explicit in every instruction and
intrinsic. It must be deduced. And again, we’ll need a
way to tell the user when efficient allocation can’t
be done. In practice, I don’t expect any function to
be using more than three tile shapes.<o:p></o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<o:p></o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
The implication of all this is that I don’t think the
greedy register allocator is well suited to figure all
of this out. We need a special pass to pre-allocate
these registers. If the function is written in a way
that makes good performance possible, it should be a
relatively simple task to allocate everything with
minimal spilling. If it isn’t possible to get good
performance, we don’t need to do anything especially
clever. We can just do something straightforward that
is correct and let the user know that they aren’t
going to be happy with the results.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">-Andy<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Philip Reames <a
href="mailto:listmail@philipreames.com"
moz-do-not-send="true"><listmail@philipreames.com></a>
<br>
<b>Sent:</b> Friday, August 14, 2020 8:29 PM<br>
<b>To:</b> Luo, Yuanke <a
href="mailto:yuanke.luo@intel.com"
moz-do-not-send="true"><yuanke.luo@intel.com></a>;
<a href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">llvm-dev@lists.llvm.org</a>;
<a href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">
florian_hahn@apple.com</a>; Kaylor, Andrew <a
href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true">
<andrew.kaylor@intel.com></a>; Topper,
Craig <a href="mailto:craig.topper@intel.com"
moz-do-not-send="true">
<craig.topper@intel.com></a>; Lu, Hongjiu
<a href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX
programming model discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p>I find your answer unconvincing. I'm not going to
debate it as I don't wish to take the time to build
the appropriate context, but my initial response is
skepticism.<o:p></o:p></p>
<p>Philip<o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/14/20 4:49 PM, Luo, Yuanke
wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">[Yuanke] AMX register is special.
It needs to be configured before use and the config
instruction is expensive. To avoid unnecessary tile
configure, we collect the tile shape information as
much as possible and combine them into one ldtilecfg
instruction. The ldtilecfg instruction should
dominate any AMX instruction that access tile
register. On the other side, the ldtilecfg should
post-dominated the instruction that define the tile
shape. For tile register spill, it should avoid
re-config due to the different tile shape, the
spilled register should be reloaded to the register
that share the same tile shape. Since tile register
allocation is special and it may allocate general
virtual register to configure tile register, we can
add a sperate pass to do it before general register
allocation pass. After register allocation, the tile
shape information is not needed anymore, so we can
transform the pseudo AMX instruction to real AMX
instruction by removing the row and column operands.<o:p></o:p></p>
<p>[Philip]<o:p></o:p></p>
<p>This seems complicated.<o:p></o:p></p>
<p>Reading through the documentation, there appears to
be a single global tile config for all tile
registers at any time.<o:p></o:p></p>
<p>Why not simply model this tile config as a
designated special register and the tile
instructions as having an implicit use of this
register? That would seem to ensure that the
register allocator has all the constraints needed.
You'd need to teach it how to spill the special
registers with the appropriate instructions, but
that seems a lot more straight forward?<o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:10.5pt;line-height:105%">[Yuanke]
In that case user need to configure the tile
register by themselves. Spilling configure
register is very expensive, because it clears all
the tile data register to zero. In our proposal,
compiler is responsible to deduce the shape for
virtual of tile data register, allocate physical
registers for them and then configure those
physical register. We may build the dependency as
you proposed and it can be used for machine IR
check to ensure tile data register is configured
before use. </span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:10.5pt;line-height:105%"> </span><o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Philip Reames <a
href="mailto:listmail@philipreames.com"
moz-do-not-send="true"><listmail@philipreames.com></a>
<br>
<b>Sent:</b> Saturday, August 15, 2020 1:17 AM<br>
<b>To:</b> Luo, Yuanke <a
href="mailto:yuanke.luo@intel.com"
moz-do-not-send="true"><yuanke.luo@intel.com></a>;
<a href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">llvm-dev@lists.llvm.org</a>;
<a href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">
florian_hahn@apple.com</a>; Kaylor, Andrew <a
href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true">
<andrew.kaylor@intel.com></a>; Topper,
Craig <a href="mailto:craig.topper@intel.com"
moz-do-not-send="true">
<craig.topper@intel.com></a>; Lu,
Hongjiu <a href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX
programming model discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p> <o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/14/20 6:27 AM, Luo, Yuanke
via llvm-dev wrote:<o:p></o:p></p>
</div>
<blockquote
style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">Hi,<o:p></o:p></p>
<p class="MsoNormal">Intel Advanced Matrix
Extensions (Intel AMX) is a new programming
paradigm consisting of two components: a set of
2-dimensional registers (tiles) representing
sub-arrays from a larger 2-dimensional memory
image, and accelerators able to operate on tiles.
Capability of Intel AMX implementation is
enumerated by palettes. Two palettes are
supported: palette 0 represents the initialized
state and palette 1 consists of 8 tile registers
of up to 1 KB size, which is controlled by a tile
control register.<o:p></o:p></p>
<p class="MsoNormal">The instruction manual is
posted at <a
href="https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html"
moz-do-not-send="true">
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html</a>.<o:p></o:p></p>
<p class="MsoNormal">The AMX abi proposal is posted
at <a
href="https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4"
moz-do-not-send="true">
https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4</a>.<o:p></o:p></p>
<p class="MsoNormal">This email is to discuss the
programming model for AMX. Florian has introduced
the matrix type and intrinsics in LLVM community.
We’d like to adopt some ideas from it.<o:p></o:p></p>
<p class="MsoNormal">Here is what we propose for the
AMX programming model.<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">1.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]--> Data type. <o:p></o:p></p>
<p class="MsoNormal">We’d like to have fixed vector
type for AMX. Since the shape to AMX register can
be configurable, the vector size is the maximum
size of AMX register. That means the vector size
is 1024 bytes.<o:p></o:p></p>
<p class="MsoNormal">The C code may look like this.<o:p></o:p></p>
<p class="MsoNormal">typedef int _tile_data
__attribute__((__vector_size__(1024),
__aligned__(64)));<o:p></o:p></p>
<p class="MsoNormal">_tile_data tile;<o:p></o:p></p>
<p class="MsoNormal">And the LLVM IR may look like
this.<o:p></o:p></p>
<p class="MsoNormal">@tile = dso_local
local_unnamed_addr global <256 x i32>
zeroinitializer, align 64<o:p></o:p></p>
<p class="MsoNormal">For llvm IR, it is nice to have
a new type x86_amxtile that can be mapped to AMX
registers.<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">2.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->AMX Intrinsics. <o:p></o:p></p>
<p class="MsoNormal">The internal intrinsics are 1:1
mapped to AMX instructions. The parameter m, n, k
identifies the shape of the tile. The shape can be
variable, but it cannot exceed the size that AMX
HW can support. Compiler can deduce shape of the
tile from the AMX intrinsics.<o:p></o:p></p>
<p class="MsoNormal" style="text-indent:5.5pt">_tile_data
_tile_loadd_internal(char m, short n, const void
*base, int stride);<o:p></o:p></p>
<p class="MsoNormal">_tile_data
_tile_dpbssd_internal(char m, short n, short k,
_tile_data dst, _tile_data src1, _tile_data src2);<o:p></o:p></p>
<p class="MsoNormal">_tile_data
_tile_dpbf16ps_internal(char m, short n, short k,
_tile_data dst, _tile_data src1, _tile_data src2);<o:p></o:p></p>
<p class="MsoNormal">void _tile_stored_internal(char
m, short n, void *base, int stride, _tile_data
tile);<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">3.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->User interfaces.<o:p></o:p></p>
<p class="MsoNormal">The tile shape and tile data
are combined into a struct in C language. The
shape of the tile is only allowed to be
initialized once. The user interface looks as
this.<o:p></o:p></p>
<p class="MsoNormal"> 3 #define
__DEFAULT_FN_AMX \<o:p></o:p></p>
<p class="MsoNormal"> 4
__attribute__((__always_inline__, __nodebug__,
__target__("amx-int8")))<o:p></o:p></p>
<p class="MsoNormal"> 9 typedef struct __tile_str
{<o:p></o:p></p>
<p class="MsoNormal">10 const char row;<o:p></o:p></p>
<p class="MsoNormal">11 const short col;<o:p></o:p></p>
<p class="MsoNormal">12 _tile_data tile;<o:p></o:p></p>
<p class="MsoNormal">13 }__tile;<o:p></o:p></p>
<p class="MsoNormal">14<o:p></o:p></p>
<p class="MsoNormal">15 __DEFAULT_FN_AMX<o:p></o:p></p>
<p class="MsoNormal">16 void __tile_loadd(__tile
*dst, const void *base, long stride) {<o:p></o:p></p>
<p class="MsoNormal">17 dst->tile =
_tile_loadd_internal(dst->row, dst->col,
base, stride);<o:p></o:p></p>
<p class="MsoNormal">18 }<o:p></o:p></p>
<p class="MsoNormal">19<o:p></o:p></p>
<p class="MsoNormal">20 __DEFAULT_FN_AMX<o:p></o:p></p>
<p class="MsoNormal">21 void __tile_dpbsud(__tile
*dst, __tile src1, __tile src2) {<o:p></o:p></p>
<p class="MsoNormal">22 dst->tile =
_tile_dpbssd_internal(src1.row, src2.col,
src1.col, dst->tile, src1.tile, src2.tile);<o:p></o:p></p>
<p class="MsoNormal">23 }<o:p></o:p></p>
<p class="MsoNormal">24<o:p></o:p></p>
<p class="MsoNormal">25 __DEFAULT_FN_AMX<o:p></o:p></p>
<p class="MsoNormal">26 void __tile_stored(void
*base, long stride, __tile src) {<o:p></o:p></p>
<p class="MsoNormal">27
_tile_stored_internal(src.row, src.col, base,
stride, src.tile);<o:p></o:p></p>
<p class="MsoNormal">28 }<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">4.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->Example code<o:p></o:p></p>
<p class="MsoNormal">The example shows how to use
the user interface in a function.
<o:p></o:p></p>
<p class="MsoNormal"> 51 void api(int cond, short
row, short col) {<o:p></o:p></p>
<p class="MsoNormal">52 __tile a = {row, col};<o:p></o:p></p>
<p class="MsoNormal">53 __tile b = {row, col};<o:p></o:p></p>
<p class="MsoNormal">54 __tile c = {row, col};<o:p></o:p></p>
<p class="MsoNormal">55<o:p></o:p></p>
<p class="MsoNormal">56 if(cond) {<o:p></o:p></p>
<p class="MsoNormal">57 __tile_loadd(&a,
buf, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">58 __tile_loadd(&b,
buf, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">59 __tile_loadd(&c,
buf, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">60 } else {<o:p></o:p></p>
<p class="MsoNormal">61 __tile_loadd(&a,
buf2, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">62 __tile_loadd(&b,
buf2, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">63 __tile_loadd(&c,
buf2, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">64 }<o:p></o:p></p>
<p class="MsoNormal"><span lang="IT">65
__tile_dpbsud(&c, a, b);</span><o:p></o:p></p>
<p class="MsoNormal">66 __tile_stored(buf, STRIDE,
c);<o:p></o:p></p>
<p class="MsoNormal">67 }<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">5.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->LLVM IR<o:p></o:p></p>
<p class="MsoNormal">The LLVM intrinsics IR take the
row and column information as the input parameter,
so that compiler can deduce the shape of tile
data. The remaining parameters are what AMX
instructions require. This is the LLVM IR
corresponding to the example code.<o:p></o:p></p>
<p class="MsoNormal">12 define dso_local void
@api(i32 %cond, i16 signext %row, i16 signext
%col) local_unnamed_addr #2 {<o:p></o:p></p>
<p class="MsoNormal">13 entry:<o:p></o:p></p>
<p class="MsoNormal">14 %tobool = icmp eq i32
%cond, 0<o:p></o:p></p>
<p class="MsoNormal">15 %sext = shl i16 %col, 8<o:p></o:p></p>
<p class="MsoNormal">16 %conv.i31 = ashr exact i16
%sext, 8<o:p></o:p></p>
<p class="MsoNormal">17 br i1 %tobool, label
%if.else, label %if.then<o:p></o:p></p>
<p class="MsoNormal">18<o:p></o:p></p>
<p class="MsoNormal">19
if.then:
; preds = %entry<o:p></o:p></p>
<p class="MsoNormal">20 %0 = tail call <256 x
i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x
i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3<o:p></o:p></p>
<p class="MsoNormal">21 %1 = tail call <256 x
i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x
i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3<o:p></o:p></p>
<p class="MsoNormal">22 %2 = tail call <256 x
i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x
i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3<o:p></o:p></p>
<p class="MsoNormal">23 br label %if.end<o:p></o:p></p>
<p class="MsoNormal">24<o:p></o:p></p>
<p class="MsoNormal">25
if.else:
; preds = %entry<o:p></o:p></p>
<p class="MsoNormal">26 %3 = tail call <256 x
i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x
i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3<o:p></o:p></p>
<p class="MsoNormal">27 %4 = tail call <256 x
i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x
i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3<o:p></o:p></p>
<p class="MsoNormal">28 %5 = tail call <256 x
i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x
i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3<o:p></o:p></p>
<p class="MsoNormal">29 br label %if.end<o:p></o:p></p>
<p class="MsoNormal">30<o:p></o:p></p>
<p class="MsoNormal">31
if.end:
; preds = %if.else, %if.then<o:p></o:p></p>
<p class="MsoNormal">32 %a.sroa.1186.0 = phi
<256 x i32> [ %3, %if.else ], [ %0, %if.then
]<o:p></o:p></p>
<p class="MsoNormal">33 %b.sroa.1068.0 = phi
<256 x i32> [ %4, %if.else ], [ %1, %if.then
]<o:p></o:p></p>
<p class="MsoNormal">34 %c.sroa.1149.0 = phi
<256 x i32> [ %5, %if.else ], [ %2, %if.then
]<o:p></o:p></p>
<p class="MsoNormal">35 %6 = tail call <256 x
i32> @llvm.x86.tdpbssd(i16 %row, i16 %conv.i31,
i16 %conv.i31, <256 x i32> %c.sroa.1149.0,
<256 x i32> %a.sroa.1186.0, <256 x
i32> %b.sroa.1068.0) #3<o:p></o:p></p>
<p class="MsoNormal">36 tail call void
@llvm.x86.tilestored64(i16 %row, i16 %conv.i31,
i8* getelementptr inbounds ([1024 x i8], [1024 x
i8]* @buf, i64 0, i64 0), i64 32, <256 x
i32> %6) #3<o:p></o:p></p>
<p class="MsoNormal">37 ret void<o:p></o:p></p>
<p class="MsoNormal">38 }<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">6.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->Shape propagation<o:p></o:p></p>
<p class="MsoNormal">When in -O0 build, some general
load/store for tile vector is generated by
front-end. We need to root from AMX intrinsics to
propagate the shape information to the virtual
tile register. If the an AMX intrinsic use the
result of load instruction, the shape is
propagated to the load and the load is transformed
to tile load intrinsic. If the store instruction
uses any result of AMX intrinsic, the shape is
propagated to store instruction and the store is
transformed to tile store intrinsic<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">7.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->Machine IR<o:p></o:p></p>
<p class="MsoNormal">Since the AMX intrinsics take
the row and column as the input parameters, we can
create a pseudo instruction corresponding to it.
The AMX intrinsics are lowered to the pseudo AMX
instruction which has extra row and column
operands corresponding to AMX intrinsic. The real
AMX instructions don’t need the row and column
operands. The row and column information should be
configured by ldtilecfg before executing any AMX
instruction.<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">8.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->Register allocation<o:p></o:p></p>
<p class="MsoNormal">AMX register is special. It
needs to be configured before use and the config
instruction is expensive. To avoid unnecessary
tile configure, we collect the tile shape
information as much as possible and combine them
into one ldtilecfg instruction. The ldtilecfg
instruction should dominate any AMX instruction
that access tile register. On the other side, the
ldtilecfg should post-dominated the instruction
that define the tile shape. For tile register
spill, it should avoid re-config due to the
different tile shape, the spilled register should
be reloaded to the register that share the same
tile shape. Since tile register allocation is
special and it may allocate general virtual
register to configure tile register, we can add a
sperate pass to do it before general register
allocation pass. After register allocation, the
tile shape information is not needed anymore, so
we can transform the pseudo AMX instruction to
real AMX instruction by removing the row and
column operands.<o:p></o:p></p>
</blockquote>
<p>This seems complicated.<o:p></o:p></p>
<p>Reading through the documentation, there appears to
be a single global tile config for all tile
registers at any time.<o:p></o:p></p>
<p>Why not simply model this tile config as a
designated special register and the tile
instructions as having an implicit use of this
register? That would seem to ensure that the
register allocator has all the constraints needed.
You'd need to teach it how to spill the special
registers with the appropriate instructions, but
that seems a lot more straight forward?<o:p></o:p></p>
<blockquote
style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">9.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->Use recommendation <o:p></o:p></p>
<p class="MsoNormal">Due to the shape configure
issue, we recommend user to define the tile shape
at the entry of the function entry and inline
function as much as possible. The AMX instructions
focus on computation instead of storage, so global
variable for tile data is not recommended.<o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:10.5pt;line-height:105%"> </span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:10.5pt;line-height:105%">Thanks</span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:10.5pt;line-height:105%">Yuanke</span><o:p></o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<o:p></o:p></p>
<pre>_______________________________________________<o:p></o:p></pre>
<pre>LLVM Developers mailing list<o:p></o:p></pre>
<pre><a href="mailto:llvm-dev@lists.llvm.org" moz-do-not-send="true">llvm-dev@lists.llvm.org</a><o:p></o:p></pre>
<pre><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></pre>
</blockquote>
</blockquote>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<br>
<br>
<br>
<br>
<br>
<o:p></o:p></p>
<pre>_______________________________________________<o:p></o:p></pre>
<pre>LLVM Developers mailing list<o:p></o:p></pre>
<pre><a href="mailto:llvm-dev@lists.llvm.org" moz-do-not-send="true">llvm-dev@lists.llvm.org</a><o:p></o:p></pre>
<pre><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></pre>
</blockquote>
<pre>-- <o:p></o:p></pre>
<pre>Hal Finkel<o:p></o:p></pre>
<pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
<pre>Leadership Computing Facility<o:p></o:p></pre>
<pre>Argonne National Laboratory<o:p></o:p></pre>
</blockquote>
<pre>-- <o:p></o:p></pre>
<pre>Hal Finkel<o:p></o:p></pre>
<pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
<pre>Leadership Computing Facility<o:p></o:p></pre>
<pre>Argonne National Laboratory<o:p></o:p></pre>
</blockquote>
<pre>-- <o:p></o:p></pre>
<pre>Hal Finkel<o:p></o:p></pre>
<pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
<pre>Leadership Computing Facility<o:p></o:p></pre>
<pre>Argonne National Laboratory<o:p></o:p></pre>
</blockquote>
<pre>-- <o:p></o:p></pre>
<pre>Hal Finkel<o:p></o:p></pre>
<pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
<pre>Leadership Computing Facility<o:p></o:p></pre>
<pre>Argonne National Laboratory<o:p></o:p></pre>
</div>
</blockquote>
<pre class="moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</body>
</html>