<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 9/4/20 8:50 AM, Luo, Yuanke wrote:<br>
</div>
<blockquote type="cite" cite="mid:SN6PR11MB3135FD8ECCEAE494295CE9759A2D0@SN6PR11MB3135.namprd11.prod.outlook.com">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:DengXian;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"\@DengXian";
panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin-top:0in;
margin-right:0in;
margin-bottom:8.0pt;
margin-left:0in;
line-height:105%;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
pre
{mso-style-priority:99;
mso-style-link:"HTML Preformatted Char";
margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Courier New";}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0in;
margin-right:0in;
margin-bottom:8.0pt;
margin-left:0in;
text-indent:21.0pt;
line-height:105%;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.HTMLPreformattedChar
{mso-style-name:"HTML Preformatted Char";
mso-style-priority:99;
mso-style-link:"HTML Preformatted";
font-family:"Courier New";}
span.EmailStyle24
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}mso-level-tab-stop:4.5in;
mso-level-number-position:left;
text-indent:-.25in;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal">Fix typo<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Luo, Yuanke <br>
<b>Sent:</b> Friday, September 4, 2020 9:47 PM<br>
<b>To:</b> 'Hal Finkel' <a class="moz-txt-link-rfc2396E" href="mailto:hfinkel@anl.gov"><hfinkel@anl.gov></a>; Topper,
Craig <a class="moz-txt-link-rfc2396E" href="mailto:craig.topper@intel.com"><craig.topper@intel.com></a>; Kaylor, Andrew
<a class="moz-txt-link-rfc2396E" href="mailto:andrew.kaylor@intel.com"><andrew.kaylor@intel.com></a>; Philip Reames
<a class="moz-txt-link-rfc2396E" href="mailto:listmail@philipreames.com"><listmail@philipreames.com></a>;
<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>; <a class="moz-txt-link-abbreviated" href="mailto:florian_hahn@apple.com">florian_hahn@apple.com</a>; Lu,
Hongjiu <a class="moz-txt-link-rfc2396E" href="mailto:hongjiu.lu@intel.com"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> RE: [llvm-dev] Intel AMX programming model
discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Hi Hal,<o:p></o:p></p>
<p class="MsoNormal">Generally, your proposal to adapt tile RA
to Greedy RA looks good to me. Thank you! I plan to do some
prototype for the proposal. Since there is 3 RA in LLVM
infrastructure, we need 3 schemes to adapt tile RA to each
existing RA. Do you like to finalize the 3 schemes first, or
you would like to review the left part of the AMX programming
model? We have some limitation to support dynamic shape and
I’d like to hear your advice. The dynamic shape requires the
ldtilecfg post-dominate the point that define shape, so we
encourage user to define their shape in the entry of the
function. Take below code as example. Ideally, we hope to
insert ldtilecfg at line 57 to config a, b, c, but in this
function the c’s shape {row, col} is defined in each if/else
clause. So at line 57, the shape of c in unknown. Do you have
any advice for such problem?</p>
</div>
</blockquote>
<p>In the example below, I'm going to assume that the function calls
are actually to get_row1() and get_row2(), neither of which can be
hoisted.</p>
<p>Just to think about this: First, we're starting the MIR with
intrinsics that take the shape parameters directly. Now you need
to:</p>
<p> 1. Identify "configuration regions". Because reconfiguring must
be done for all registers at once, and because reconfiguring zeros
all of the tile registers, each configuration region is a
connected component in the union of the live ranges of all virtual
tile registers. Thus, first collect the configuration regions via
trivial clustering (two instructions are part of the same
configuration region is they share any live range of a tile
register).</p>
<p> 2. If the region will require more than eight types of shapes,
then you'll need to calculate a min cut of the region, split the
region by inserting spill/restores, so that the region requires
only <= 8 number of shapes.<br>
</p>
<p> 3. If you do it this way, all of the instructions in your code
below will be part of one, big configuration region. Generally,
you want to put the ldtilecfg at the common dominating point of
all of the tile instructions in the region. Now, as you point out
in your example below, we can't simply put the ldtilecfg at the
common dominating point: that point might not actually be
dominated by the definitions of all of the shape inputs needed.<br>
</p>
<p> 4. One thing that you might do is iterative splitting. If not
all of the definitions of the shape inputs dominate the desired
insertion point, first you might try iteratively hosting the
defining instructions to make it so the definitions do dominate.
If they still don't, then split the ldtilecfg into each successor
of the desired insertion point. Do this recursively until, for
each ldtilecfg, the inputs for each dynamic-shape tile register
size dominate the insertion point.</p>
<p> 5. This procedure, alone, might fail in the case where the
ldtilecfg is sunk past the point of definition of one of the tile
registers. Imagine, in your example below, that there was some use
of the tile registers a and b before the if. In that case, you'll
need to split those live ranges by spilling into memory around the
desired ldtilecfg insertion point. That creates a new
configuration region that you'll insert into the queue of
configuration regions to process.<br>
</p>
<p>I'm sure that this is not the only possible heuristic. This would
be easier, I think, if the hardware did not zero all of the
registers when you reconfigured any of them, but I suppose that it
is what it is at this point.</p>
<p> -Hal<br>
</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:SN6PR11MB3135FD8ECCEAE494295CE9759A2D0@SN6PR11MB3135.namprd11.prod.outlook.com">
<div class="WordSection1">
<p class="MsoNormal"><o:p></o:p></p>
<p class="MsoNormal">52 void kernel(int cond) {<o:p></o:p></p>
<p class="MsoNormal">53 _tile a = {row, 8};<o:p></o:p></p>
<p class="MsoNormal">54 _tile b = {8, col};<o:p></o:p></p>
<p class="MsoNormal">55<o:p></o:p></p>
<p class="MsoNormal">56 // copy shape to stack slot<o:p></o:p></p>
<p class="MsoNormal">57 // ldtilecfg a, b, c<o:p></o:p></p>
<p class="MsoNormal">58 if(cond) {<o:p></o:p></p>
<p class="MsoNormal">59 short row = get_row();<o:p></o:p></p>
<p class="MsoNormal">60 short col = get_row();<o:p></o:p></p>
<p class="MsoNormal">61 _tile c = {row, col};<o:p></o:p></p>
<p class="MsoNormal">62 __tile_loadd(&a, buf, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">63 __tile_loadd(&b, buf, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">64 __tile_loadd(&c, buf, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">65 } else {<o:p></o:p></p>
<p class="MsoNormal">66 short row = get_row();<o:p></o:p></p>
<p class="MsoNormal">67 short col = get_row();<o:p></o:p></p>
<p class="MsoNormal">68 _tile c = {row, col};<o:p></o:p></p>
<p class="MsoNormal">69 __tile_loadd(&a, buf2, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">70 __tile_loadd(&b, buf2, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">71 __tile_loadd(&c, buf2, STRIDE);<o:p></o:p></p>
<p class="MsoNormal">72 }<o:p></o:p></p>
<p class="MsoNormal">73 __tile_dpbsud(&c, a, b);<o:p></o:p></p>
<p class="MsoNormal">74 __tile_stored(buf, STRIDE, c);<o:p></o:p></p>
<p class="MsoNormal">75 }<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Thanks<o:p></o:p></p>
<p class="MsoNormal">Yuanke<o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Hal Finkel <<a
href="mailto:hfinkel@anl.gov" moz-do-not-send="true">hfinkel@anl.gov</a>>
<br>
<b>Sent:</b> Friday, September 4, 2020 5:59 PM<br>
<b>To:</b> Luo, Yuanke <<a
href="mailto:yuanke.luo@intel.com"
moz-do-not-send="true">yuanke.luo@intel.com</a>>;
Topper, Craig <<a href="mailto:craig.topper@intel.com"
moz-do-not-send="true">craig.topper@intel.com</a>>;
Kaylor, Andrew <<a
href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true">andrew.kaylor@intel.com</a>>;
Philip Reames <<a
href="mailto:listmail@philipreames.com"
moz-do-not-send="true">listmail@philipreames.com</a>>;
<a href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">llvm-dev@lists.llvm.org</a>; <a
href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">
florian_hahn@apple.com</a>; Lu, Hongjiu <<a
href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true">hongjiu.lu@intel.com</a>><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX programming model
discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p><o:p> </o:p></p>
<div>
<p class="MsoNormal">On 9/4/20 3:37 AM, Luo, Yuanke wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">Hi Hal,<o:p></o:p></p>
<p class="MsoNormal">Thank you for the ideas that help us to
improve the design, and sorry for replying late. There is
something I am not able to figure out and there some special
trait for tile RA.<o:p></o:p></p>
</blockquote>
<p><o:p> </o:p></p>
<p>You're quite welcome.<o:p></o:p></p>
<p><o:p> </o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">1.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->X86RegisterInfo::getRegAllocationHints
can tell RA which physical register is preferred, but it
can’t force RA to just allocate the hinted register. If the
hinted register is not meet, RA would allocate other
register.<o:p></o:p></p>
</blockquote>
<p><o:p> </o:p></p>
<p>I addressed this below, but I could have been clearer. Like
SystemZRegisterInfo::getRegAllocationHints does sometimes,
when hinting the tile registers, the function will return
true. This turns the preference into a hard constraint, and
the allocator will not allocate any other register. That's my
understanding from reading the code.<o:p></o:p></p>
<p><o:p> </o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">2.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->The shape information should
be attached to each virtual register and physical register
which is allocated. How to store and get the shape
information with limited code change on existing RA?<o:p></o:p></p>
</blockquote>
<p><o:p> </o:p></p>
<p>For each virtual register, getRegAllocationHints could just
recompute the shape information. If this isn't a constant-time
operation, however, you'll probably want to cache the computed
shape requirements in X86MachineFunctionInfo. You can add a
map from registers to shape information in that class, and
accesses it from getRegAllocationHints. You can store
information about the physical registers there too.<o:p></o:p></p>
<p>Regarding the physical registers, you can grab this
information in the pre-rewrite phase. Override addPreRewrite
in X86TargetMachine.cpp. You'll need a small pass that records
relevant information about the assignments (which, I imagine,
is the same small pass that updates the LDTILECFG
instructions). For an example of such a pass, see
AMDGPU/GCNNSAReassign.cpp<o:p></o:p></p>
<p><o:p> </o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">3.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->When a tile register is
spilled, the shape should also be bound the corresponding
spill stack slot, so that it can be assigned the physical
tile register with the same shape.<o:p></o:p></p>
</blockquote>
<p><o:p> </o:p></p>
<p>I'm not sure what you mean. If you don't want to just be
conservative about the spill size allocation, you do need to
know the shape in order to compute the spill-location size. I
assume that you can grab that out of X86MachineFunctionInfo
from storeRegToStackSlot/loadRegFromStackSlot or
eliminateFrameIndex (or copyPhysReg) as needed.<o:p></o:p></p>
<p><o:p> </o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">4.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->There is no mov/copy
instruction for tile register. To copy tile register, we
need to store the tile register to memory and load the data
from memory to another register. So a lot of code for live
interval split in Greedy RA is unnecessary for tile register
allocation.<o:p></o:p></p>
</blockquote>
<p><o:p> </o:p></p>
<p>Yes, but this just means that you need to support copying
through memory. Setting CopyCost = -1 in X86RegisterInfo.td
might help as well.<o:p></o:p></p>
<p><o:p> </o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">5.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->Compiler can support register
spill, but spill should be avoided for performance benefit.
We prefer reporting warning on register spill, so that user
can realize it and adjust their code to avoid register
spill.<o:p></o:p></p>
</blockquote>
<p><o:p> </o:p></p>
<p>If you want to emit a diagnostic, you may be able to do that
from storeRegToStackSlot. In any case, please make use of the
optimization-remark infrastructure. For an example of how to
do this, see RAGreedy::reportNumberOfSplillsReloads in
RegAllocGreedy.cpp.<o:p></o:p></p>
<p><o:p> </o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">If there is no easy way to take the
advantage of current RA infrastructure, there are some pros
to have a separate RA for tile register.<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l1
level1 lfo4">
<!--[if !supportLists]--><span style="mso-list:Ignore">1.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->We can limit the risk to break
RA for general register on each arch. If there are some bugs
on tile RA, only application that use AMX is affected.<o:p></o:p></p>
</blockquote>
<p><o:p> </o:p></p>
<p>That's true. But I also worry about that. Any time you need
to write non-trivial code that will be used relatively rarely,
it's likely to have bugs that take a long time to show up. If
you can plug into the generic infrastructure, you benefit from
the fact that it's highly-covered, often-used code. Not that
you might not run into bugs, of course, especially if you're
using it in a new way, but the base logic is likely to already
be robust.<o:p></o:p></p>
<p><o:p> </o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l1
level1 lfo4">
<!--[if !supportLists]--><span style="mso-list:Ignore">2.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->We can customize the special
trait (config, spilt, spill) of tile register in the sperate
RA more freely.<o:p></o:p></p>
</blockquote>
<p><o:p> </o:p></p>
<p>True.<o:p></o:p></p>
<p> -Hal<o:p></o:p></p>
<p><o:p> </o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">For RegAllocFast, I agree with you. Each
region of register is small, and since the performance is
not the first priority, we can insert multiply config for
each small region.<o:p></o:p></p>
<p class="MsoNormal">As you recommend looking at the PBQP
solver, I’ll take some time to investigate it and go back to
you.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Thanks<o:p></o:p></p>
<p class="MsoNormal">-Yuanke<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Hal Finkel <a
href="mailto:hfinkel@anl.gov" moz-do-not-send="true"><hfinkel@anl.gov></a>
<br>
<b>Sent:</b> Monday, August 24, 2020 5:03 PM<br>
<b>To:</b> Luo, Yuanke <a
href="mailto:yuanke.luo@intel.com"
moz-do-not-send="true"><yuanke.luo@intel.com></a>;
Topper, Craig
<a href="mailto:craig.topper@intel.com"
moz-do-not-send="true"><craig.topper@intel.com></a>;
Kaylor, Andrew
<a href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true"><andrew.kaylor@intel.com></a>;
Philip Reames
<a href="mailto:listmail@philipreames.com"
moz-do-not-send="true"><listmail@philipreames.com></a>;
<a href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">
llvm-dev@lists.llvm.org</a>; <a
href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">florian_hahn@apple.com</a>; Lu,
Hongjiu
<a href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX programming
model discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p>Hi, Yuanke,<o:p></o:p></p>
<p>Thanks for writing this up. Let me back up a bit because
the scheme I proposed last week doesn't work without further
modification: within a particular "configuration region"
(i.e., the code in between the LDTILECFG and the TILERELEASE
(or next LDTILECFG)), each tile register can only be used
with one shape, and in addition, no register can have its
shape changed without zeroing out all of the tile registers.
Thus, just using different register classes for the
different shapes, as I had suggested, isn't sufficient to
model the allocation requirements. That would not prevent
the same register from essentially being assigned to
differently-shaped virtual registers with non-overlapping
live ranges within one configuration region.<o:p></o:p></p>
<p>Also, as you point out, when multiple non-static tile
shapes are in use, if you use one register class for each
shape, you would need different register classes for these
too. Luckily, I don't think that using the separate register
classes actually buys us anything, so please disregard that
suggestion of mine. Use only one register class.<o:p></o:p></p>
<p>Once the configuration regions are identified, you'll know
how many tile register shapes are required. If this number
is greater than eight, then you'll need to cut the region
(requiring all live tiles to be spilled and restored around
each re-configuration point). After that, we'll assume that
we have eight or fewer distinct shapes.<o:p></o:p></p>
<p>Now the problem is that you need to allocate registers,
satisfying all of the usual constraints (non-overlapping
live ranges, etc.), but with an additional constraint: once
a physical register has been used with some particular tile
shape, it cannot be assigned to any other tile shape.<o:p></o:p></p>
<p>I think that the current infrastructure can support this as
follows:<o:p></o:p></p>
<p> 1. Add an override X86RegisterInfo::getRegAllocationHints.
Like SystemZRegisterInfo::getRegAllocationHints does
sometimes, when hinting the tile registers, the function
will return true (to indicate a hard constraint). As
registers are assigned in RegAllocGreedy,
getRegAllocationHints is called for each virtual register.
For virtual tile registers, look at the passed VirtRegMap,
etc. for already-assigned tile virtual registers with
different shape requirements as the current virtual register
(you'll need to cache the shape requirements in
X86MachineFunctionInfo for this to be efficient), and return
a hints list consisting of all other non-reserved tile
registers.<o:p></o:p></p>
<p> 2. To support RegAllocFast, which doesn't use
getRegAllocationHints, you would need to make the
configuration regions small enough that it doesn't matter
(and if you're doing this around every tile instruction,
this is automatically true).<o:p></o:p></p>
<p> 3. To support RegAllocPBQP (which is likely a good thing
to do, but probably not required), I believe you can support
this by adding custom constraints to the solver (kind of
like what AArch64PBQPRegAlloc.cpp does).<o:p></o:p></p>
<p>Once the allocation process is complete, you'll need to go
back and update the LDTILECFG data to reflect the chosen
shape -> register mapping.<o:p></o:p></p>
<p>What I don't know, however, is how well the
getRegAllocationHints method will work. The benefit is that
you don't need to write a custom pre-allocator allocator. On
the other hand, it might visit the virtual registers to
assign in a suboptimal order because it doesn't really
understand the constraint being imposed (generally, we just
assign larger live ranges first). On the other hand, it is a
greedy algorithm and if you want something systematically
closer to optimal, maybe you should be using PBQP anyway. If
you do end up needing a custom allocator for these, I
recommend looking at the PBQP solver (which, as I recall, is
independently reusable).<o:p></o:p></p>
<p>Hopefully, this is more-helpful advice.<o:p></o:p></p>
<p> -Hal<o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/21/20 9:54 PM, Luo, Yuanke wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<p class="MsoNormal">It seems I make a mistake on sharing
register unit. Can we share register unit for tile
register that is within different tile register class
(different register class has different tile shape)?
Think about two virtual tile register
<i>%2:vtile1x1 </i>and <i>%3:vtile1x2</i>. First %2 is
allocated to $tmm0, after that %2 is killed and %t3 is
allocated to $tmm0. This is not allowed, because when
$tmm0 is allocated to %2, its shape is configured to
1x1. If we reallocated $tmm0 to %3, then we need to
re-config $tmm0 to 1x2 which cause $tmm0~$tmm7 be
clobbered.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Yuanke<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Luo, Yuanke <br>
<b>Sent:</b> Friday, August 21, 2020 2:12 PM<br>
<b>To:</b> Hal Finkel <a
href="mailto:hfinkel@anl.gov"
moz-do-not-send="true"><hfinkel@anl.gov></a>;
Topper, Craig
<a href="mailto:craig.topper@intel.com"
moz-do-not-send="true"><craig.topper@intel.com></a>;
Kaylor, Andrew
<a href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true"><andrew.kaylor@intel.com></a>;
Philip Reames
<a href="mailto:listmail@philipreames.com"
moz-do-not-send="true"><listmail@philipreames.com></a>;
<a href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">
llvm-dev@lists.llvm.org</a>; <a
href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">florian_hahn@apple.com</a>;
Lu, Hongjiu
<a href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> RE: [llvm-dev] Intel AMX programming
model discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Hi Hal,<o:p></o:p></p>
<p class="MsoNormal">The proposal is attractive to me, but
there is something I still can’t figure out. Let’s take
below MIR as an example. We assume we have 256 register
classes (vtile1x1, vtile1x2, …, tile16x16).<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l3
level1 lfo6">
<!--[if !supportLists]--><span style="mso-list:Ignore">1.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->After instruction
selection, the pseudo AMX instruction is generated. The
name of pseudo instructions have ‘P’ prefix. Now all the
AMX pseudo instruction take vtile as register class.
Let’s assume %13 is constant 3, %10 is constant 4 and
%14 is variable.<o:p></o:p></p>
<p class="MsoNormal"><i> %1:vtile = <b><span
style="color:red">P</span></b>TILELOADDV %13:gr16,
%10:gr16, %17:gr64, 1, %18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>
<p class="MsoNormal"><i> %2:vtile = <b>P</b>TILELOADDV
%10:gr16, %14:gr16, %17:gr64, 1, %18:gr64_nosp, 0,
$noreg</i><o:p></o:p></p>
<p class="MsoNormal"><i> %3:vtile = <b>P</b>TILELOADDV
%13:gr16, %14:gr16, %17:gr64, 1, %18:gr64_nosp, 0,
$noreg</i><o:p></o:p></p>
<p class="MsoNormal"><i>%21:vtile = <b>P</b>TDPBSSDV
%13:gr16, %10:gr16, %14:gr16, %3:vtile(tied-def 0),
%1:vtile, %2:vtile
</i><o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l3
level1 lfo6">
<!--[if !supportLists]--><span style="mso-list:Ignore">2.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->The
configuration-placement pass looks at all of the AMX
pseudo-instructions and identifies regions in which the
pseudo-instructions use the same configuration
parameters. It first replaces the register class for all
tile registers whose shape is known in compile-time.
Since the shape of %1 is constant, so it replaces
%1:vtile with %1:vtile3x4 which change the register
class and morph pseudo instruction into AMX real
instruction. The shape of %2 and %3 is unknown in
compile-time, so it arbitrarily picks up a tile register
class which is not assigned before and assign the
register class to %2 and %3. After register class
allocation, the code is transformed as this. The
register class for %2:vtile1x1 and %3:vtile1x2 is
allocated.
<o:p></o:p></p>
<p class="MsoNormal"><i> <b>P</b>LDTILECFG</i><o:p></o:p></p>
<p class="MsoNormal"><i> %1:vtile3x4 = TILELOADDV
%17:gr64, 1, %18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>
<p class="MsoNormal"><i> %2:vtile1x1 = TILELOADDV
%17:gr64, 1, %18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>
<p class="MsoNormal"><i> %3:vtile1x2 = TILELOADDV
%17:gr64, 1, %18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>
<p class="MsoNormal"><i>%21:vtile1x2 = TDPBSSDV
%9:vtile1x2(tied-def 0), %1:vtile3x4, %2:vtile1x1
</i><o:p></o:p></p>
<p class="MsoNormal">Something I am not figured out. <o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l2
level1 lfo8">
<!--[if !supportLists]--><span style="mso-list:Ignore">1.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->I not sure if we can have
AMX instruction’s inputs and outputs fit multiple
register classes (vtile1x1, …, vtile16x16), otherwise we
need 256 pseudo instructions.<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l2
level1 lfo8">
<!--[if !supportLists]--><span style="mso-list:Ignore">2.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->Whether 256 register class
is enough to be allocated. There may be more 256 unknow
shape tile registers.<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l2
level1 lfo8">
<!--[if !supportLists]--><span style="mso-list:Ignore">3.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->In this pass we also find
the proper pointer (common dominator) to insert
ldtilecfg, but at this time the register is allocated,
we don’t know the shape of each physical tile register.
So we just insert a pseudo tile config instruction.<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l3
level1 lfo6">
<!--[if !supportLists]--><span style="mso-list:Ignore">3.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->All tile register class
share the same register unit. We do register allocation
by the framework, and the code is transformed as this.<o:p></o:p></p>
<p class="MsoNormal"><i> $tmm0 = TILELOADDV %17:gr64, 1,
%18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>
<p class="MsoNormal"><i> $tmm1 = TILELOADDV %17:gr64, 1,
%18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>
<p class="MsoNormal"><i> $tmm2 = TILELOADDV %17:gr64, 1,
%18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>
<p class="MsoNormal"><i>$tmm2 = TDPBSSDV $tmm2(tied-def
0), $tmm0, $tmm1</i><o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l3
level1 lfo6">
<!--[if !supportLists]--><span style="mso-list:Ignore">4.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->Run config pass to collect
the shape of each physical tile register and config
them. The code can be generated as below. Here is the
problem, how can we know the shape of the physical tile
register?<o:p></o:p></p>
<p class="MsoNormal"><b><i> MOV row, col info to
%stack.0 for each physical tile register ??????</i></b><o:p></o:p></p>
<p class="MsoNormal"><b><i> LDTILECFG %stack.0, 1,
$noreg, 0, $noreg, implicit-def $tmm0, implicit-def
$tmm1, implicit-def $tmm2, implicit-def $tmm3,
implicit-def $tmm4, implicit-def $tmm5, implicit-def
$tmm6, implicit-def $tmm7</i></b><o:p></o:p></p>
<p class="MsoNormal"><i> $tmm0 = TILELOADDV %17:gr64, 1,
%18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>
<p class="MsoNormal"><i> $tmm1 = TILELOADDV %17:gr64, 1,
%18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>
<p class="MsoNormal"><i> $tmm2 = TILELOADDV %17:gr64, 1,
%18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>
<p class="MsoNormal"><i>$tmm2 = TDPBSSDV $tmm2(tied-def
0), $tmm0, $tmm1</i><o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Thanks<o:p></o:p></p>
<p class="MsoNormal">Yuanke<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
... <o:p></o:p></p>
</div>
</blockquote>
<pre>-- <o:p></o:p></pre>
<pre>Hal Finkel<o:p></o:p></pre>
<pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
<pre>Leadership Computing Facility<o:p></o:p></pre>
<pre>Argonne National Laboratory<o:p></o:p></pre>
</blockquote>
<pre>-- <o:p></o:p></pre>
<pre>Hal Finkel<o:p></o:p></pre>
<pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
<pre>Leadership Computing Facility<o:p></o:p></pre>
<pre>Argonne National Laboratory<o:p></o:p></pre>
</div>
</blockquote>
<pre class="moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</body>
</html>