<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 8/20/20 3:50 PM, Topper, Craig
wrote:<br>
</div>
<blockquote type="cite" cite="mid:MWHPR11MB004601229570D3828D7FBCC6935A0@MWHPR11MB0046.namprd11.prod.outlook.com">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin-top:0in;
margin-right:0in;
margin-bottom:8.0pt;
margin-left:0in;
line-height:105%;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
pre
{mso-style-priority:99;
mso-style-link:"HTML Preformatted Char";
margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Courier New";}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0in;
margin-right:0in;
margin-bottom:8.0pt;
margin-left:0in;
text-indent:21.0pt;
line-height:105%;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.HTMLPreformattedChar
{mso-style-name:"HTML Preformatted Char";
mso-style-priority:99;
mso-style-link:"HTML Preformatted";
font-family:"Courier New";}
span.EmailStyle23
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}mso-level-tab-stop:4.5in;
mso-level-number-position:left;
text-indent:-.25in;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal">Ignore my spill comment for now. That’s
more of an optimization.<o:p></o:p></p>
<p class="MsoNormal">Lets say I have a 2x3 tile a 3x2 tile and I
multiply them to make a 2x2 tile. I have 3 different sizes of
tiles. So my instruction uses 3 different register classes for
its virtual registers.<o:p></o:p></p>
<p class="MsoNormal">The pass that inserts the ldtilecfg needs
to configure the physical tiles so lets say it configures tmm0
to 2x3, tmm1 to 3x2 and tmm2 to 2x2.<o:p></o:p></p>
<p class="MsoNormal">Register classes as I know them in llvm
have a static list of physical registers in them. So all 3 of
the register classes for my virtual registers contain all 8
physical tmm registers? How does the register allocator know
to use tmm0 for the 2x3 virtual register, and tmm1 for the 3x2
virtual register, and tmm2 for the 2x2 virtual register.<o:p></o:p></p>
<p class="MsoNormal">~Craig</p>
</div>
</blockquote>
<p><br>
</p>
<p>Ah, okay. I think I see why we're not on the same page. The
architectural definition has 8 files registers, tmm0-tmm7, but I
was thinking that you would not model it that way. Instead, we
could have registers:</p>
<p>tmm0_1x1 ... tmm7_1x1</p>
<p>...</p>
<p>tmm0_16x16 ... tmm7_16x16</p>
<p>where tmm0_1x1 as aliases of tmm0_1x2, ... tmm0_16x16, and so on.<br>
</p>
<p>and corresponding register classes RegClassTmm1x1, ...,
RegClassTmm16x16 (I don't mean to imply this exact naming
convention). So, within each region, you assign the relevant
virtual registers to have a register class of RegClassTmm1x1, or
whatever, and then once register allocation is done, you adjust
the ldtilecfg data for each region so that it actually makes
whatever registers were assigned by the right tile sizes.</p>
<p>You would not want to have N^2 version of all of the instructions
either, but I think you can just have the instructions defined to
take some overall register class (containing all of the registers)
and then you can call constrainRegClass in the
configuration-placement pass.</p>
<p>Thinking about it however, maybe having the different physical
registers isn't actually needed. If you know which tile config
each register needed based on the instructions, maybe you can have
only 8 of them and just update the ldtilecfg based on the usage
information after allocation regardless.</p>
<p> -Hal<br>
</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:MWHPR11MB004601229570D3828D7FBCC6935A0@MWHPR11MB0046.namprd11.prod.outlook.com">
<div class="WordSection1">
<p class="MsoNormal"><o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Hal Finkel <a class="moz-txt-link-rfc2396E" href="mailto:hfinkel@anl.gov"><hfinkel@anl.gov></a> <br>
<b>Sent:</b> Thursday, August 20, 2020 1:27 PM<br>
<b>To:</b> Topper, Craig <a class="moz-txt-link-rfc2396E" href="mailto:craig.topper@intel.com"><craig.topper@intel.com></a>;
Kaylor, Andrew <a class="moz-txt-link-rfc2396E" href="mailto:andrew.kaylor@intel.com"><andrew.kaylor@intel.com></a>; Luo,
Yuanke <a class="moz-txt-link-rfc2396E" href="mailto:yuanke.luo@intel.com"><yuanke.luo@intel.com></a>; Philip Reames
<a class="moz-txt-link-rfc2396E" href="mailto:listmail@philipreames.com"><listmail@philipreames.com></a>;
<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>; <a class="moz-txt-link-abbreviated" href="mailto:florian_hahn@apple.com">florian_hahn@apple.com</a>; Lu,
Hongjiu <a class="moz-txt-link-rfc2396E" href="mailto:hongjiu.lu@intel.com"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX programming model
discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p><o:p> </o:p></p>
<div>
<p class="MsoNormal">On 8/20/20 2:47 PM, Topper, Craig wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">I think I’m still missing something here.
The configuration is per tile. The multiply instructions
take a MxK tile and multiply it by a KxN tile and accumulate
into an MxN tile. So the configuration needs to know how
many of each size of tile it needs to avoid a spill.
Wouldn’t the register allocator then need to know which
physical tiles have been configured to which sizes so that
it only chooses those tiles for an operand that needs that
size?<o:p></o:p></p>
</blockquote>
<p><o:p> </o:p></p>
<p>Yes, I think so. But it will because that information is
essentially encoded in the virtual register classes. I
certainly could be missing something. It seems like you first
figure that out, and then you assign virtual tile registers
corresponding to the correct tile sizes. Perhaps this comes
down to what you mean by "avoid a spill." We still might
spill, and I assume that the infrastructure always needs to
deal with that. We should continue to do instruction
scheduling in order to minimize register pressure. Once we
assign the right virtual register classes to the AMX
instructions, shouldn't this automatically happen? If we do
spill, since none of the original live ranges cross the
ldtilecfg, then there shouldn't be any fundamental issue with
using a regular load/store spill implementation.<o:p></o:p></p>
<p>I'm definitely not an expert in this instruction set, so I
may just not understand some aspect of this. If there's
something I'm overlooking, a little example would be helpful.<o:p></o:p></p>
<p>Thanks again,<o:p></o:p></p>
<p>Hal<o:p></o:p></p>
<p><o:p> </o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">~Craig<o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Hal Finkel <a
href="mailto:hfinkel@anl.gov" moz-do-not-send="true"><hfinkel@anl.gov></a>
<br>
<b>Sent:</b> Thursday, August 20, 2020 12:35 PM<br>
<b>To:</b> Topper, Craig <a
href="mailto:craig.topper@intel.com"
moz-do-not-send="true"><craig.topper@intel.com></a>;
Kaylor, Andrew
<a href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true"><andrew.kaylor@intel.com></a>;
Luo, Yuanke
<a href="mailto:yuanke.luo@intel.com"
moz-do-not-send="true"><yuanke.luo@intel.com></a>;
Philip Reames <a
href="mailto:listmail@philipreames.com"
moz-do-not-send="true">
<listmail@philipreames.com></a>; <a
href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">llvm-dev@lists.llvm.org</a>;
<a href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">florian_hahn@apple.com</a>; Lu,
Hongjiu <a href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true">
<hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX programming
model discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p> <o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/19/20 3:09 PM, Topper, Craig
wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">The width and height can be runtime
values that we would just copy into 64 byte configuration
block we pass to ldtilecfg. So the code doesn’t need to be
multiversioned. The user code would also use those values
to update pointers in the loops they write using the
tiles. If we can’t determine that two tiles were defined
with the same width and height we need to assume the shape
is different and try to avoid ever giving the same tile.<o:p></o:p></p>
<p class="MsoNormal">Hal, for your suggestion would which
physical registers are in which register class be defined
dynamically before register allocation?<o:p></o:p></p>
</blockquote>
<p> <o:p></o:p></p>
<p>Here's my thought:<o:p></o:p></p>
<p>First, you have a set of intrinsics that take tile values
along with tile configuration parameters (which, presently,
seem just to be the sizes). These get lowered into
pseudo-instructions that do the same. Thus, you have some
register class that represents these arbitrarily-sized tile
registers that you'll assign to these pseudo-instruction
operands (i.e., they take virtual tile registers right after
instruction selection). You might use the 16x16 tile
register class for this purpose, but it shouldn't really
matter.<o:p></o:p></p>
<p>Second, you run this configuration-placement pass. This
pass looks at all of the AMX pseudo-instructions and
identifies regions in which the pseudo-instructions use the
same configuration parameters (i.e., the same SSA values
and/or constants). This pass might reorder the
pseudo-instructions when legal in order to form larger
regions. Then it places the ldtilecfg at the start of each
region (in some common dominating position). ldtilecfg
implicitly defines all of the tile registers in every
concrete class of tile registers (all 256 of them, or
whatever). The pseudo-instructions are replaced by real MI
instructions taking a tile register class appropriate for
the configuration (which will default to the 16x16 class for
cases where the configuration is not a compile-time-known
constant). When the configuration is a known constant, the
instructions take operands with a register class appropriate
for that configuration (e.g., 1x1, 4x4).<o:p></o:p></p>
<p>Third, the rest of the framework runs as usual. Tile
registers from the appropriate class are allocated by the
register allocator. No live range of any virtual tile
register can pass through the ldtilecfg (because it defines
them all), but that's okay, none of live ranges will by
construction (the configuration-placement pass ensures
this).<o:p></o:p></p>
<p>Thanks again,<o:p></o:p></p>
<p>Hal<o:p></o:p></p>
<p> <o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Hal Finkel <a
href="mailto:hfinkel@anl.gov" moz-do-not-send="true"><hfinkel@anl.gov></a>
<br>
<b>Sent:</b> Wednesday, August 19, 2020 12:52 PM<br>
<b>To:</b> Kaylor, Andrew <a
href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true"><andrew.kaylor@intel.com></a>;
Luo, Yuanke
<a href="mailto:yuanke.luo@intel.com"
moz-do-not-send="true"><yuanke.luo@intel.com></a>;
Philip Reames <a
href="mailto:listmail@philipreames.com"
moz-do-not-send="true">
<listmail@philipreames.com></a>; <a
href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">llvm-dev@lists.llvm.org</a>;
<a href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">florian_hahn@apple.com</a>;
Topper, Craig
<a href="mailto:craig.topper@intel.com"
moz-do-not-send="true"><craig.topper@intel.com></a>;
Lu, Hongjiu
<a href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX programming
model discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p> <o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/19/20 10:24 AM, Kaylor, Andrew
wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p>> When the tile shape is unknown at compile time,
how do you plan to do the register allocation of the
tiles? My question is: do you do the allocation for this
case in the same way as you would if you knew the size
was 16x16 (i.e., conservatively assume the largest
size)?<o:p></o:p></p>
<p class="MsoNormal">I think what will happen is that the
registers are allocated based on a number of runtime
values that are assumed to be different from one another
but less than or equal to 16. So, for example, we’ll
allocate registers for MxN tiles, NxM tiles and MxM
tiles without knowing what M and N are. Then at runtime
the values of these variables will be used to create the
actual tile configuration. The instructions that need to
know the shape take these runtime values as operands.<o:p></o:p></p>
</blockquote>
<p> <o:p></o:p></p>
<p>So you're going to multiversion the code?<o:p></o:p></p>
<p>In any case, my point is that you probably don't need a
custom register allocator. If you just define the tile
registers and make sure that the ldtilecfgs implicitly
defines them all, then the regular infrastructure likely
works. You'll have a bunch of register classes, but that's
not necessarily a problem. I recommend trying this, and
let us know what you discover, before we go down the road
of a new, dedicated allocator just for these registers.<o:p></o:p></p>
<p> -Hal<o:p></o:p></p>
<p> <o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">There may be some artifacts coming
from the front end that conservatively assume a 16x16
tile, but I think those generally go away in SROA or
later specialized passes. Yuanke can confirm or correct
my understanding of this.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Hal Finkel <a
href="mailto:hfinkel@anl.gov"
moz-do-not-send="true"><hfinkel@anl.gov></a>
<br>
<b>Sent:</b> Wednesday, August 19, 2020 5:14 AM<br>
<b>To:</b> Luo, Yuanke <a
href="mailto:yuanke.luo@intel.com"
moz-do-not-send="true"><yuanke.luo@intel.com></a>;
Kaylor, Andrew
<a href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true"><andrew.kaylor@intel.com></a>;
Philip Reames
<a href="mailto:listmail@philipreames.com"
moz-do-not-send="true"><listmail@philipreames.com></a>;
<a href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">
llvm-dev@lists.llvm.org</a>; <a
href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">florian_hahn@apple.com</a>;
Topper, Craig
<a href="mailto:craig.topper@intel.com"
moz-do-not-send="true"><craig.topper@intel.com></a>;
Lu, Hongjiu
<a href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX programming
model discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p> <o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/19/20 5:34 AM, Luo, Yuanke
wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">There is no problem to have 256
register classes. Just a lot of register classes to
me.<o:p></o:p></p>
<p class="MsoNormal">We don’t assume the shape of each
physical register be 16x16, it is defined by user. For
variable shape, I mean the shape is known in runtime
and in compile time the shape is unknown. Take below
code as an example, the %row and %col are variable
instead of constant. Compiler recognizes
llvm.x86.tileloadd64 and deduce the shape of %0 is
%row x %col.<o:p></o:p></p>
<p class="MsoNormal">%0 = tail call <256 x i32>
@llvm.x86.tileloadd64(i16 %row, i16 %col, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]*
@buf, i64 0, i64 0), i64 32)<o:p></o:p></p>
</blockquote>
<p> <o:p></o:p></p>
<p>When the tile shape is unknown at compile time, how do
you plan to do the register allocation of the tiles? My
question is: do you do the allocation for this case in
the same way as you would if you knew the size was 16x16
(i.e., conservatively assume the largest size)?<o:p></o:p></p>
<p>Thanks again,<o:p></o:p></p>
<p>Hal<o:p></o:p></p>
<p> <o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Hal Finkel <a
href="mailto:hfinkel@anl.gov"
moz-do-not-send="true"><hfinkel@anl.gov></a>
<br>
<b>Sent:</b> Wednesday, August 19, 2020 4:58 PM<br>
<b>To:</b> Luo, Yuanke <a
href="mailto:yuanke.luo@intel.com"
moz-do-not-send="true"><yuanke.luo@intel.com></a>;
Kaylor, Andrew
<a href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true"><andrew.kaylor@intel.com></a>;
Philip Reames
<a href="mailto:listmail@philipreames.com"
moz-do-not-send="true"><listmail@philipreames.com></a>;
<a href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">
llvm-dev@lists.llvm.org</a>; <a
href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">florian_hahn@apple.com</a>;
Topper, Craig
<a href="mailto:craig.topper@intel.com"
moz-do-not-send="true"><craig.topper@intel.com></a>;
Lu, Hongjiu
<a href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX
programming model discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p> <o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/19/20 2:21 AM, Luo, Yuanke
wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Hi Hal,<o:p></o:p></p>
<p class="MsoNormal">There is 3 aspect to be solved. <o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l1
level1 lfo2">
<!--[if !supportLists]--><span
style="mso-list:Ignore">1.<span style="font:7.0pt
"Times New Roman"">
</span></span><!--[endif]-->The HW support max
shape 16x16, so there are many register classes from
1x1 to 16x16. We need 256 register classes.
<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l1
level1 lfo2">
<!--[if !supportLists]--><span
style="mso-list:Ignore">2.<span style="font:7.0pt
"Times New Roman"">
</span></span><!--[endif]-->We want to support
variable shape, so compiler don’t know what register
class to fit tile shape as it is only known in
runtime.<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l1
level1 lfo2">
<!--[if !supportLists]--><span
style="mso-list:Ignore">3.<span style="font:7.0pt
"Times New Roman"">
</span></span><!--[endif]-->The tile configure is
to configure physical tile register, so we need to
allocate register and then we know the shape of each
physical tile register and configure the tile
register.<o:p></o:p></p>
<p class="MsoNormal">I think your suggestion is
helpful to reduce the complexity if we only support
fixed (constant) tile shape.<o:p></o:p></p>
<p class="MsoNormal">-Yuanke<o:p></o:p></p>
</blockquote>
<p> <o:p></o:p></p>
<p>Thanks, Yuanke.<o:p></o:p></p>
<p>It's not clear to me that having 256 register classes
is, in itself, a problem. Is it?<o:p></o:p></p>
<p>What does it mean to support variable-shape tiles in
this context? Do you do something other than
conservatively assume that they are 16x16 for
register-allocation purposes?<o:p></o:p></p>
<p> -Hal<o:p></o:p></p>
<p> <o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Hal Finkel <a
href="mailto:hfinkel@anl.gov"
moz-do-not-send="true"><hfinkel@anl.gov></a>
<br>
<b>Sent:</b> Wednesday, August 19, 2020 8:20 AM<br>
<b>To:</b> Kaylor, Andrew <a
href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true"><andrew.kaylor@intel.com></a>;
Philip Reames
<a href="mailto:listmail@philipreames.com"
moz-do-not-send="true"><listmail@philipreames.com></a>;
Luo, Yuanke
<a href="mailto:yuanke.luo@intel.com"
moz-do-not-send="true"><yuanke.luo@intel.com></a>;
<a href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">
llvm-dev@lists.llvm.org</a>; <a
href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">florian_hahn@apple.com</a>;
Topper, Craig
<a href="mailto:craig.topper@intel.com"
moz-do-not-send="true"><craig.topper@intel.com></a>;
Lu, Hongjiu
<a href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX
programming model discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p>Hi, Andy,<o:p></o:p></p>
<p>I don't quite understand everything that's going on
here. Could we model this as:<o:p></o:p></p>
<p> 1. Define a collection of register classes, one
for 2x4 tiles, one for 4x2 tiles, etc. each
populated with a set of tile registers. Registers
can have aliasing relationships (instead of worrying
of any kind of subregister/superregister
relationships -- these won't be useful anyway).<o:p></o:p></p>
<p> 2. Define the tile-configuration instructions so
that they implicitly define all of the registers in
all of the classes.<o:p></o:p></p>
<p>Then you would still need to pre-schedule the tile
operations as you've described, and collect the
configuration information in order to add the
ldtilecfgs, but the regular register allocator can
handle the allocation itself in the usual way. What
do you think?<o:p></o:p></p>
<p> -Hal<o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/18/20 6:58 PM, Kaylor,
Andrew via llvm-dev wrote:<o:p></o:p></p>
</div>
<blockquote
style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
The AMX registers are complicated. The single
configuration register (which is mostly used
implicitly, similar to MXCSR for floating point)
controls the shape of all the tile registers, and
if you change the tile configuration every single
tile register is cleared. In practice, if we have
to change the the configuration while any of the
tile registers are live, performance is going to
be terrible. We need to handle this case for
correctness, but users of this programming
interface will need to have enough awareness of
the performance issues and the hardware details to
prevent this. We’ll also want a diagnostic that
lets the user know when this has happened.<o:p></o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<o:p></o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
When the tile configuration is set, the shape of
each tile is locked in, so the individual tile
registers aren’t interchangeable at that point. If
a function needs 2x4 tiles, 4x2 tiles, and 4x4
tiles, the configuration needs to be set with this
in mind. The shape isn’t explicit in every
instruction and intrinsic. It must be deduced. And
again, we’ll need a way to tell the user when
efficient allocation can’t be done. In practice, I
don’t expect any function to be using more than
three tile shapes.<o:p></o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<o:p></o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
The implication of all this is that I don’t think
the greedy register allocator is well suited to
figure all of this out. We need a special pass to
pre-allocate these registers. If the function is
written in a way that makes good performance
possible, it should be a relatively simple task to
allocate everything with minimal spilling. If it
isn’t possible to get good performance, we don’t
need to do anything especially clever. We can just
do something straightforward that is correct and
let the user know that they aren’t going to be
happy with the results.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">-Andy<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Philip Reames <a
href="mailto:listmail@philipreames.com"
moz-do-not-send="true"><listmail@philipreames.com></a>
<br>
<b>Sent:</b> Friday, August 14, 2020 8:29 PM<br>
<b>To:</b> Luo, Yuanke <a
href="mailto:yuanke.luo@intel.com"
moz-do-not-send="true"><yuanke.luo@intel.com></a>;
<a href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">llvm-dev@lists.llvm.org</a>;
<a href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">
florian_hahn@apple.com</a>; Kaylor, Andrew <a
href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true">
<andrew.kaylor@intel.com></a>; Topper,
Craig <a href="mailto:craig.topper@intel.com"
moz-do-not-send="true">
<craig.topper@intel.com></a>; Lu,
Hongjiu <a href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX
programming model discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p>I find your answer unconvincing. I'm not going
to debate it as I don't wish to take the time to
build the appropriate context, but my initial
response is skepticism.<o:p></o:p></p>
<p>Philip<o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/14/20 4:49 PM, Luo,
Yuanke wrote:<o:p></o:p></p>
</div>
<blockquote
style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">[Yuanke] AMX register is
special. It needs to be configured before use
and the config instruction is expensive. To
avoid unnecessary tile configure, we collect the
tile shape information as much as possible and
combine them into one ldtilecfg instruction. The
ldtilecfg instruction should dominate any AMX
instruction that access tile register. On the
other side, the ldtilecfg should post-dominated
the instruction that define the tile shape. For
tile register spill, it should avoid re-config
due to the different tile shape, the spilled
register should be reloaded to the register that
share the same tile shape. Since tile register
allocation is special and it may allocate
general virtual register to configure tile
register, we can add a sperate pass to do it
before general register allocation pass. After
register allocation, the tile shape information
is not needed anymore, so we can transform the
pseudo AMX instruction to real AMX instruction
by removing the row and column operands.<o:p></o:p></p>
<p>[Philip]<o:p></o:p></p>
<p>This seems complicated.<o:p></o:p></p>
<p>Reading through the documentation, there
appears to be a single global tile config for
all tile registers at any time.<o:p></o:p></p>
<p>Why not simply model this tile config as a
designated special register and the tile
instructions as having an implicit use of this
register? That would seem to ensure that the
register allocator has all the constraints
needed. You'd need to teach it how to spill the
special registers with the appropriate
instructions, but that seems a lot more straight
forward?<o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:10.5pt;line-height:105%">[Yuanke]
In that case user need to configure the tile
register by themselves. Spilling configure
register is very expensive, because it clears
all the tile data register to zero. In our
proposal, compiler is responsible to deduce
the shape for virtual of tile data register,
allocate physical registers for them and then
configure those physical register. We may
build the dependency as you proposed and it
can be used for machine IR check to ensure
tile data register is configured before use. </span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:10.5pt;line-height:105%"> </span><o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Philip Reames <a
href="mailto:listmail@philipreames.com"
moz-do-not-send="true"><listmail@philipreames.com></a>
<br>
<b>Sent:</b> Saturday, August 15, 2020 1:17
AM<br>
<b>To:</b> Luo, Yuanke <a
href="mailto:yuanke.luo@intel.com"
moz-do-not-send="true"><yuanke.luo@intel.com></a>;
<a href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">llvm-dev@lists.llvm.org</a>;
<a href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">
florian_hahn@apple.com</a>; Kaylor, Andrew
<a href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true">
<andrew.kaylor@intel.com></a>;
Topper, Craig <a
href="mailto:craig.topper@intel.com"
moz-do-not-send="true">
<craig.topper@intel.com></a>; Lu,
Hongjiu <a
href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX
programming model discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p> <o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/14/20 6:27 AM, Luo,
Yuanke via llvm-dev wrote:<o:p></o:p></p>
</div>
<blockquote
style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">Hi,<o:p></o:p></p>
<p class="MsoNormal">Intel Advanced Matrix
Extensions (Intel AMX) is a new programming
paradigm consisting of two components: a set
of 2-dimensional registers (tiles)
representing sub-arrays from a larger
2-dimensional memory image, and accelerators
able to operate on tiles. Capability of Intel
AMX implementation is enumerated by palettes.
Two palettes are supported: palette 0
represents the initialized state and palette 1
consists of 8 tile registers of up to 1 KB
size, which is controlled by a tile control
register.<o:p></o:p></p>
<p class="MsoNormal">The instruction manual is
posted at <a
href="https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html"
moz-do-not-send="true">
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html</a>.<o:p></o:p></p>
<p class="MsoNormal">The AMX abi proposal is
posted at <a
href="https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4"
moz-do-not-send="true">
https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4</a>.<o:p></o:p></p>
<p class="MsoNormal">This email is to discuss
the programming model for AMX. Florian has
introduced the matrix type and intrinsics in
LLVM community. We’d like to adopt some ideas
from it.<o:p></o:p></p>
<p class="MsoNormal">Here is what we propose for
the AMX programming model.<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">1.<span
style="font:7.0pt "Times New
Roman"">
</span></span><!--[endif]--> Data type. <o:p></o:p></p>
<p class="MsoNormal">We’d like to have fixed
vector type for AMX. Since the shape to AMX
register can be configurable, the vector size
is the maximum size of AMX register. That
means the vector size is 1024 bytes.<o:p></o:p></p>
<p class="MsoNormal">The C code may look like
this.<o:p></o:p></p>
<p class="MsoNormal">typedef int _tile_data
__attribute__((__vector_size__(1024),
__aligned__(64)));<o:p></o:p></p>
<p class="MsoNormal">_tile_data tile;<o:p></o:p></p>
<p class="MsoNormal">And the LLVM IR may look
like this.<o:p></o:p></p>
<p class="MsoNormal">@tile = dso_local
local_unnamed_addr global <256 x i32>
zeroinitializer, align 64<o:p></o:p></p>
<p class="MsoNormal">For llvm IR, it is nice to
have a new type x86_amxtile that can be mapped
to AMX registers.<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">2.<span
style="font:7.0pt "Times New
Roman"">
</span></span><!--[endif]-->AMX Intrinsics.
<o:p></o:p></p>
<p class="MsoNormal">The internal intrinsics are
1:1 mapped to AMX instructions. The parameter
m, n, k identifies the shape of the tile. The
shape can be variable, but it cannot exceed
the size that AMX HW can support. Compiler can
deduce shape of the tile from the AMX
intrinsics.<o:p></o:p></p>
<p class="MsoNormal" style="text-indent:5.5pt">_tile_data
_tile_loadd_internal(char m, short n, const
void *base, int stride);<o:p></o:p></p>
<p class="MsoNormal">_tile_data
_tile_dpbssd_internal(char m, short n, short
k, _tile_data dst, _tile_data src1, _tile_data
src2);<o:p></o:p></p>
<p class="MsoNormal">_tile_data
_tile_dpbf16ps_internal(char m, short n, short
k, _tile_data dst, _tile_data src1, _tile_data
src2);<o:p></o:p></p>
<p class="MsoNormal">void
_tile_stored_internal(char m, short n, void
*base, int stride, _tile_data tile);<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">3.<span
style="font:7.0pt "Times New
Roman"">
</span></span><!--[endif]-->User interfaces.<o:p></o:p></p>
<p class="MsoNormal">The tile shape and tile
data are combined into a struct in C language.
The shape of the tile is only allowed to be
initialized once. The user interface looks as
this.<o:p></o:p></p>
<p class="MsoNormal"> 3 #define
__DEFAULT_FN_AMX \<o:p></o:p></p>
<p class="MsoNormal"> 4
__attribute__((__always_inline__, __nodebug__,
__target__("amx-int8")))<o:p></o:p></p>
<p class="MsoNormal"> 9 typedef struct
__tile_str {<o:p></o:p></p>
<p class="MsoNormal">10 const char row;<o:p></o:p></p>
<p class="MsoNormal">11 const short col;<o:p></o:p></p>
<p class="MsoNormal">12 _tile_data tile;<o:p></o:p></p>
<p class="MsoNormal">13 }__tile;<o:p></o:p></p>
<p class="MsoNormal">14<o:p></o:p></p>
<p class="MsoNormal">15 __DEFAULT_FN_AMX<o:p></o:p></p>
<p class="MsoNormal">16 void __tile_loadd(__tile
*dst, const void *base, long stride) {<o:p></o:p></p>
<p class="MsoNormal">17 dst->tile =
_tile_loadd_internal(dst->row, dst->col,
base, stride);<o:p></o:p></p>
<p class="MsoNormal">18 }<o:p></o:p></p>
<p class="MsoNormal">19<o:p></o:p></p>
<p class="MsoNormal">20 __DEFAULT_FN_AMX<o:p></o:p></p>
<p class="MsoNormal">21 void
__tile_dpbsud(__tile *dst, __tile src1, __tile
src2) {<o:p></o:p></p>
<p class="MsoNormal">22 dst->tile =
_tile_dpbssd_internal(src1.row, src2.col,
src1.col, dst->tile, src1.tile, src2.tile);<o:p></o:p></p>
<p class="MsoNormal">23 }<o:p></o:p></p>
<p class="MsoNormal">24<o:p></o:p></p>
<p class="MsoNormal">25 __DEFAULT_FN_AMX<o:p></o:p></p>
<p class="MsoNormal">26 void __tile_stored(void
*base, long stride, __tile src) {<o:p></o:p></p>
<p class="MsoNormal">27
_tile_stored_internal(src.row, src.col, base,
stride, src.tile);<o:p></o:p></p>
<p class="MsoNormal">28 }<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">4.<span
style="font:7.0pt "Times New
Roman"">
</span></span><!--[endif]-->Example code<o:p></o:p></p>
<p class="MsoNormal">The example shows how to
use the user interface in a function.
<o:p></o:p></p>
<p class="MsoNormal"> 51 void api(int cond,
short row, short col) {<o:p></o:p></p>
<p class="MsoNormal">52 __tile a = {row, col};<o:p></o:p></p>
<p class="MsoNormal">53 __tile b = {row, col};<o:p></o:p></p>
<p class="MsoNormal">54 __tile c = {row, col};<o:p></o:p></p>
<p class="MsoNormal">55<o:p></o:p></p>
<p class="MsoNormal">56 if(cond) {<o:p></o:p></p>
<p class="MsoNormal">57 __tile_loadd(&a,
buf, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">58 __tile_loadd(&b,
buf, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">59 __tile_loadd(&c,
buf, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">60 } else {<o:p></o:p></p>
<p class="MsoNormal">61 __tile_loadd(&a,
buf2, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">62 __tile_loadd(&b,
buf2, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">63 __tile_loadd(&c,
buf2, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">64 }<o:p></o:p></p>
<p class="MsoNormal"><span lang="IT">65
__tile_dpbsud(&c, a, b);</span><o:p></o:p></p>
<p class="MsoNormal">66 __tile_stored(buf,
STRIDE, c);<o:p></o:p></p>
<p class="MsoNormal">67 }<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">5.<span
style="font:7.0pt "Times New
Roman"">
</span></span><!--[endif]-->LLVM IR<o:p></o:p></p>
<p class="MsoNormal">The LLVM intrinsics IR take
the row and column information as the input
parameter, so that compiler can deduce the
shape of tile data. The remaining parameters
are what AMX instructions require. This is the
LLVM IR corresponding to the example code.<o:p></o:p></p>
<p class="MsoNormal">12 define dso_local void
@api(i32 %cond, i16 signext %row, i16 signext
%col) local_unnamed_addr #2 {<o:p></o:p></p>
<p class="MsoNormal">13 entry:<o:p></o:p></p>
<p class="MsoNormal">14 %tobool = icmp eq i32
%cond, 0<o:p></o:p></p>
<p class="MsoNormal">15 %sext = shl i16 %col,
8<o:p></o:p></p>
<p class="MsoNormal">16 %conv.i31 = ashr exact
i16 %sext, 8<o:p></o:p></p>
<p class="MsoNormal">17 br i1 %tobool, label
%if.else, label %if.then<o:p></o:p></p>
<p class="MsoNormal">18<o:p></o:p></p>
<p class="MsoNormal">19
if.then:
; preds = %entry<o:p></o:p></p>
<p class="MsoNormal">20 %0 = tail call <256
x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x
i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)
#3<o:p></o:p></p>
<p class="MsoNormal">21 %1 = tail call <256
x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x
i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)
#3<o:p></o:p></p>
<p class="MsoNormal">22 %2 = tail call <256
x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x
i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)
#3<o:p></o:p></p>
<p class="MsoNormal">23 br label %if.end<o:p></o:p></p>
<p class="MsoNormal">24<o:p></o:p></p>
<p class="MsoNormal">25
if.else:
; preds = %entry<o:p></o:p></p>
<p class="MsoNormal">26 %3 = tail call <256
x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x
i8], [1024 x i8]* @buf2, i64 0, i64 0), i64
32) #3<o:p></o:p></p>
<p class="MsoNormal">27 %4 = tail call <256
x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x
i8], [1024 x i8]* @buf2, i64 0, i64 0), i64
32) #3<o:p></o:p></p>
<p class="MsoNormal">28 %5 = tail call <256
x i32> @llvm.x86.tileloadd64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x
i8], [1024 x i8]* @buf2, i64 0, i64 0), i64
32) #3<o:p></o:p></p>
<p class="MsoNormal">29 br label %if.end<o:p></o:p></p>
<p class="MsoNormal">30<o:p></o:p></p>
<p class="MsoNormal">31
if.end:
; preds = %if.else, %if.then<o:p></o:p></p>
<p class="MsoNormal">32 %a.sroa.1186.0 = phi
<256 x i32> [ %3, %if.else ], [ %0,
%if.then ]<o:p></o:p></p>
<p class="MsoNormal">33 %b.sroa.1068.0 = phi
<256 x i32> [ %4, %if.else ], [ %1,
%if.then ]<o:p></o:p></p>
<p class="MsoNormal">34 %c.sroa.1149.0 = phi
<256 x i32> [ %5, %if.else ], [ %2,
%if.then ]<o:p></o:p></p>
<p class="MsoNormal">35 %6 = tail call <256
x i32> @llvm.x86.tdpbssd(i16 %row, i16
%conv.i31, i16 %conv.i31, <256 x i32>
%c.sroa.1149.0, <256 x i32>
%a.sroa.1186.0, <256 x i32>
%b.sroa.1068.0) #3<o:p></o:p></p>
<p class="MsoNormal">36 tail call void
@llvm.x86.tilestored64(i16 %row, i16
%conv.i31, i8* getelementptr inbounds ([1024 x
i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32,
<256 x i32> %6) #3<o:p></o:p></p>
<p class="MsoNormal">37 ret void<o:p></o:p></p>
<p class="MsoNormal">38 }<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">6.<span
style="font:7.0pt "Times New
Roman"">
</span></span><!--[endif]-->Shape
propagation<o:p></o:p></p>
<p class="MsoNormal">When in -O0 build, some
general load/store for tile vector is
generated by front-end. We need to root from
AMX intrinsics to propagate the shape
information to the virtual tile register. If
the an AMX intrinsic use the result of load
instruction, the shape is propagated to the
load and the load is transformed to tile load
intrinsic. If the store instruction uses any
result of AMX intrinsic, the shape is
propagated to store instruction and the store
is transformed to tile store intrinsic<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">7.<span
style="font:7.0pt "Times New
Roman"">
</span></span><!--[endif]-->Machine IR<o:p></o:p></p>
<p class="MsoNormal">Since the AMX intrinsics
take the row and column as the input
parameters, we can create a pseudo instruction
corresponding to it. The AMX intrinsics are
lowered to the pseudo AMX instruction which
has extra row and column operands
corresponding to AMX intrinsic. The real AMX
instructions don’t need the row and column
operands. The row and column information
should be configured by ldtilecfg before
executing any AMX instruction.<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">8.<span
style="font:7.0pt "Times New
Roman"">
</span></span><!--[endif]-->Register
allocation<o:p></o:p></p>
<p class="MsoNormal">AMX register is special. It
needs to be configured before use and the
config instruction is expensive. To avoid
unnecessary tile configure, we collect the
tile shape information as much as possible and
combine them into one ldtilecfg instruction.
The ldtilecfg instruction should dominate any
AMX instruction that access tile register. On
the other side, the ldtilecfg should
post-dominated the instruction that define the
tile shape. For tile register spill, it should
avoid re-config due to the different tile
shape, the spilled register should be reloaded
to the register that share the same tile
shape. Since tile register allocation is
special and it may allocate general virtual
register to configure tile register, we can
add a sperate pass to do it before general
register allocation pass. After register
allocation, the tile shape information is not
needed anymore, so we can transform the pseudo
AMX instruction to real AMX instruction by
removing the row and column operands.<o:p></o:p></p>
</blockquote>
<p>This seems complicated.<o:p></o:p></p>
<p>Reading through the documentation, there
appears to be a single global tile config for
all tile registers at any time.<o:p></o:p></p>
<p>Why not simply model this tile config as a
designated special register and the tile
instructions as having an implicit use of this
register? That would seem to ensure that the
register allocator has all the constraints
needed. You'd need to teach it how to spill the
special registers with the appropriate
instructions, but that seems a lot more straight
forward?<o:p></o:p></p>
<blockquote
style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo4">
<!--[if !supportLists]--><span
style="mso-list:Ignore">9.<span
style="font:7.0pt "Times New
Roman"">
</span></span><!--[endif]-->Use
recommendation <o:p></o:p></p>
<p class="MsoNormal">Due to the shape configure
issue, we recommend user to define the tile
shape at the entry of the function entry and
inline function as much as possible. The AMX
instructions focus on computation instead of
storage, so global variable for tile data is
not recommended.<o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:10.5pt;line-height:105%"> </span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:10.5pt;line-height:105%">Thanks</span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:10.5pt;line-height:105%">Yuanke</span><o:p></o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<o:p></o:p></p>
<pre>_______________________________________________<o:p></o:p></pre>
<pre>LLVM Developers mailing list<o:p></o:p></pre>
<pre><a href="mailto:llvm-dev@lists.llvm.org" moz-do-not-send="true">llvm-dev@lists.llvm.org</a><o:p></o:p></pre>
<pre><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></pre>
</blockquote>
</blockquote>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<o:p></o:p></p>
<pre>_______________________________________________<o:p></o:p></pre>
<pre>LLVM Developers mailing list<o:p></o:p></pre>
<pre><a href="mailto:llvm-dev@lists.llvm.org" moz-do-not-send="true">llvm-dev@lists.llvm.org</a><o:p></o:p></pre>
<pre><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></pre>
</blockquote>
<pre>-- <o:p></o:p></pre>
<pre>Hal Finkel<o:p></o:p></pre>
<pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
<pre>Leadership Computing Facility<o:p></o:p></pre>
<pre>Argonne National Laboratory<o:p></o:p></pre>
</blockquote>
<pre>-- <o:p></o:p></pre>
<pre>Hal Finkel<o:p></o:p></pre>
<pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
<pre>Leadership Computing Facility<o:p></o:p></pre>
<pre>Argonne National Laboratory<o:p></o:p></pre>
</blockquote>
<pre>-- <o:p></o:p></pre>
<pre>Hal Finkel<o:p></o:p></pre>
<pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
<pre>Leadership Computing Facility<o:p></o:p></pre>
<pre>Argonne National Laboratory<o:p></o:p></pre>
</blockquote>
<pre>-- <o:p></o:p></pre>
<pre>Hal Finkel<o:p></o:p></pre>
<pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
<pre>Leadership Computing Facility<o:p></o:p></pre>
<pre>Argonne National Laboratory<o:p></o:p></pre>
</blockquote>
<pre>-- <o:p></o:p></pre>
<pre>Hal Finkel<o:p></o:p></pre>
<pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
<pre>Leadership Computing Facility<o:p></o:p></pre>
<pre>Argonne National Laboratory<o:p></o:p></pre>
</blockquote>
<pre>-- <o:p></o:p></pre>
<pre>Hal Finkel<o:p></o:p></pre>
<pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
<pre>Leadership Computing Facility<o:p></o:p></pre>
<pre>Argonne National Laboratory<o:p></o:p></pre>
</div>
</blockquote>
<pre class="moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</body>
</html>