<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
<p>Hi, Andy,</p>
<p>I don't quite understand everything that's going on here. Could
we model this as:</p>
<p> 1. Define a collection of register classes, one for 2x4 tiles,
one for 4x2 tiles, etc. each populated with a set of tile
registers. Registers can have aliasing relationships (instead of
worrying of any kind of subregister/superregister relationships --
these won't be useful anyway).</p>
<p> 2. Define the tile-configuration instructions so that they
implicitly define all of the registers in all of the classes.</p>
<p>Then you would still need to pre-schedule the tile operations as
you've described, and collect the configuration information in
order to add the ldtilecfgs, but the regular register allocator
can handle the allocation itself in the usual way. What do you
think?<br>
</p>
<p> -Hal<br>
</p>
<div class="moz-cite-prefix">On 8/18/20 6:58 PM, Kaylor, Andrew via
llvm-dev wrote:<br>
</div>
<blockquote type="cite" cite="mid:BY5PR11MB4370829FF4A7BDBF90DE7F52E85C0@BY5PR11MB4370.namprd11.prod.outlook.com">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin-top:0in;
margin-right:0in;
margin-bottom:8.0pt;
margin-left:0in;
line-height:105%;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
pre
{mso-style-priority:99;
mso-style-link:"HTML Preformatted Char";
margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Courier New";}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0in;
margin-right:0in;
margin-bottom:8.0pt;
margin-left:0in;
text-indent:21.0pt;
line-height:105%;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.HTMLPreformattedChar
{mso-style-name:"HTML Preformatted Char";
mso-style-priority:99;
mso-style-link:"HTML Preformatted";
font-family:"Courier New";}
span.EmailStyle23
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
The AMX registers are complicated. The single configuration
register (which is mostly used implicitly, similar to MXCSR
for floating point) controls the shape of all the tile
registers, and if you change the tile configuration every
single tile register is cleared. In practice, if we have to
change the the configuration while any of the tile registers
are live, performance is going to be terrible. We need to
handle this case for correctness, but users of this
programming interface will need to have enough awareness of
the performance issues and the hardware details to prevent
this. We’ll also want a diagnostic that lets the user know
when this has happened.<o:p></o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<o:p> </o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
When the tile configuration is set, the shape of each tile is
locked in, so the individual tile registers aren’t
interchangeable at that point. If a function needs 2x4 tiles,
4x2 tiles, and 4x4 tiles, the configuration needs to be set
with this in mind. The shape isn’t explicit in every
instruction and intrinsic. It must be deduced. And again,
we’ll need a way to tell the user when efficient allocation
can’t be done. In practice, I don’t expect any function to be
using more than three tile shapes.<o:p></o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<o:p> </o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
The implication of all this is that I don’t think the greedy
register allocator is well suited to figure all of this out.
We need a special pass to pre-allocate these registers. If the
function is written in a way that makes good performance
possible, it should be a relatively simple task to allocate
everything with minimal spilling. If it isn’t possible to get
good performance, we don’t need to do anything especially
clever. We can just do something straightforward that is
correct and let the user know that they aren’t going to be
happy with the results.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">-Andy<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Philip Reames
<a class="moz-txt-link-rfc2396E" href="mailto:listmail@philipreames.com"><listmail@philipreames.com></a> <br>
<b>Sent:</b> Friday, August 14, 2020 8:29 PM<br>
<b>To:</b> Luo, Yuanke <a class="moz-txt-link-rfc2396E" href="mailto:yuanke.luo@intel.com"><yuanke.luo@intel.com></a>;
<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>; <a class="moz-txt-link-abbreviated" href="mailto:florian_hahn@apple.com">florian_hahn@apple.com</a>; Kaylor,
Andrew <a class="moz-txt-link-rfc2396E" href="mailto:andrew.kaylor@intel.com"><andrew.kaylor@intel.com></a>; Topper, Craig
<a class="moz-txt-link-rfc2396E" href="mailto:craig.topper@intel.com"><craig.topper@intel.com></a>; Lu, Hongjiu
<a class="moz-txt-link-rfc2396E" href="mailto:hongjiu.lu@intel.com"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX programming model
discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p>I find your answer unconvincing. I'm not going to debate it
as I don't wish to take the time to build the appropriate
context, but my initial response is skepticism.<o:p></o:p></p>
<p>Philip<o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/14/20 4:49 PM, Luo, Yuanke wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">[Yuanke] AMX register is special. It
needs to be configured before use and the config instruction
is expensive. To avoid unnecessary tile configure, we
collect the tile shape information as much as possible and
combine them into one ldtilecfg instruction. The ldtilecfg
instruction should dominate any AMX instruction that access
tile register. On the other side, the ldtilecfg should
post-dominated the instruction that define the tile shape.
For tile register spill, it should avoid re-config due to
the different tile shape, the spilled register should be
reloaded to the register that share the same tile shape.
Since tile register allocation is special and it may
allocate general virtual register to configure tile
register, we can add a sperate pass to do it before general
register allocation pass. After register allocation, the
tile shape information is not needed anymore, so we can
transform the pseudo AMX instruction to real AMX instruction
by removing the row and column operands.<o:p></o:p></p>
<p>[Philip]<o:p></o:p></p>
<p>This seems complicated.<o:p></o:p></p>
<p>Reading through the documentation, there appears to be a
single global tile config for all tile registers at any
time.<o:p></o:p></p>
<p>Why not simply model this tile config as a designated
special register and the tile instructions as having an
implicit use of this register? That would seem to ensure
that the register allocator has all the constraints needed.
You'd need to teach it how to spill the special registers
with the appropriate instructions, but that seems a lot more
straight forward?<o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:10.5pt;line-height:105%">[Yuanke] In that
case user need to configure the tile register by
themselves. Spilling configure register is very expensive,
because it clears all the tile data register to zero. In
our proposal, compiler is responsible to deduce the shape
for virtual of tile data register, allocate physical
registers for them and then configure those physical
register. We may build the dependency as you proposed and
it can be used for machine IR check to ensure tile data
register is configured before use. </span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:10.5pt;line-height:105%"> </span><o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<b>From:</b> Philip Reames <a
href="mailto:listmail@philipreames.com"
moz-do-not-send="true"><listmail@philipreames.com></a>
<br>
<b>Sent:</b> Saturday, August 15, 2020 1:17 AM<br>
<b>To:</b> Luo, Yuanke <a
href="mailto:yuanke.luo@intel.com"
moz-do-not-send="true"><yuanke.luo@intel.com></a>;
<a href="mailto:llvm-dev@lists.llvm.org"
moz-do-not-send="true">llvm-dev@lists.llvm.org</a>; <a
href="mailto:florian_hahn@apple.com"
moz-do-not-send="true">
florian_hahn@apple.com</a>; Kaylor, Andrew <a
href="mailto:andrew.kaylor@intel.com"
moz-do-not-send="true">
<andrew.kaylor@intel.com></a>; Topper, Craig <a
href="mailto:craig.topper@intel.com"
moz-do-not-send="true">
<craig.topper@intel.com></a>; Lu, Hongjiu <a
href="mailto:hongjiu.lu@intel.com"
moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
<b>Subject:</b> Re: [llvm-dev] Intel AMX programming
model discussion.<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p> <o:p></o:p></p>
<div>
<p class="MsoNormal">On 8/14/20 6:27 AM, Luo, Yuanke via
llvm-dev wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">Hi,<o:p></o:p></p>
<p class="MsoNormal">Intel Advanced Matrix Extensions (Intel
AMX) is a new programming paradigm consisting of two
components: a set of 2-dimensional registers (tiles)
representing sub-arrays from a larger 2-dimensional memory
image, and accelerators able to operate on tiles.
Capability of Intel AMX implementation is enumerated by
palettes. Two palettes are supported: palette 0 represents
the initialized state and palette 1 consists of 8 tile
registers of up to 1 KB size, which is controlled by a
tile control register.<o:p></o:p></p>
<p class="MsoNormal">The instruction manual is posted at <a
href="https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html"
moz-do-not-send="true">
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html</a>.<o:p></o:p></p>
<p class="MsoNormal">The AMX abi proposal is posted at <a
href="https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4"
moz-do-not-send="true">
https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4</a>.<o:p></o:p></p>
<p class="MsoNormal">This email is to discuss the
programming model for AMX. Florian has introduced the
matrix type and intrinsics in LLVM community. We’d like to
adopt some ideas from it.<o:p></o:p></p>
<p class="MsoNormal">Here is what we propose for the AMX
programming model.<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">1.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]--> Data type. <o:p></o:p></p>
<p class="MsoNormal">We’d like to have fixed vector type for
AMX. Since the shape to AMX register can be configurable,
the vector size is the maximum size of AMX register. That
means the vector size is 1024 bytes.<o:p></o:p></p>
<p class="MsoNormal">The C code may look like this.<o:p></o:p></p>
<p class="MsoNormal">typedef int _tile_data
__attribute__((__vector_size__(1024), __aligned__(64)));<o:p></o:p></p>
<p class="MsoNormal">_tile_data tile;<o:p></o:p></p>
<p class="MsoNormal">And the LLVM IR may look like this.<o:p></o:p></p>
<p class="MsoNormal">@tile = dso_local local_unnamed_addr
global <256 x i32> zeroinitializer, align 64<o:p></o:p></p>
<p class="MsoNormal">For llvm IR, it is nice to have a new
type x86_amxtile that can be mapped to AMX registers.<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">2.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->AMX Intrinsics. <o:p></o:p></p>
<p class="MsoNormal">The internal intrinsics are 1:1 mapped
to AMX instructions. The parameter m, n, k identifies the
shape of the tile. The shape can be variable, but it
cannot exceed the size that AMX HW can support. Compiler
can deduce shape of the tile from the AMX intrinsics.<o:p></o:p></p>
<p class="MsoNormal" style="text-indent:5.5pt">_tile_data
_tile_loadd_internal(char m, short n, const void *base,
int stride);<o:p></o:p></p>
<p class="MsoNormal">_tile_data _tile_dpbssd_internal(char
m, short n, short k, _tile_data dst, _tile_data src1,
_tile_data src2);<o:p></o:p></p>
<p class="MsoNormal">_tile_data _tile_dpbf16ps_internal(char
m, short n, short k, _tile_data dst, _tile_data src1,
_tile_data src2);<o:p></o:p></p>
<p class="MsoNormal">void _tile_stored_internal(char m,
short n, void *base, int stride, _tile_data tile);<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">3.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->User interfaces.<o:p></o:p></p>
<p class="MsoNormal">The tile shape and tile data are
combined into a struct in C language. The shape of the
tile is only allowed to be initialized once. The user
interface looks as this.<o:p></o:p></p>
<p class="MsoNormal"> 3 #define __DEFAULT_FN_AMX \<o:p></o:p></p>
<p class="MsoNormal"> 4 __attribute__((__always_inline__,
__nodebug__, __target__("amx-int8")))<o:p></o:p></p>
<p class="MsoNormal"> 9 typedef struct __tile_str {<o:p></o:p></p>
<p class="MsoNormal">10 const char row;<o:p></o:p></p>
<p class="MsoNormal">11 const short col;<o:p></o:p></p>
<p class="MsoNormal">12 _tile_data tile;<o:p></o:p></p>
<p class="MsoNormal">13 }__tile;<o:p></o:p></p>
<p class="MsoNormal">14<o:p></o:p></p>
<p class="MsoNormal">15 __DEFAULT_FN_AMX<o:p></o:p></p>
<p class="MsoNormal">16 void __tile_loadd(__tile *dst, const
void *base, long stride) {<o:p></o:p></p>
<p class="MsoNormal">17 dst->tile =
_tile_loadd_internal(dst->row, dst->col, base,
stride);<o:p></o:p></p>
<p class="MsoNormal">18 }<o:p></o:p></p>
<p class="MsoNormal">19<o:p></o:p></p>
<p class="MsoNormal">20 __DEFAULT_FN_AMX<o:p></o:p></p>
<p class="MsoNormal">21 void __tile_dpbsud(__tile *dst,
__tile src1, __tile src2) {<o:p></o:p></p>
<p class="MsoNormal">22 dst->tile =
_tile_dpbssd_internal(src1.row, src2.col, src1.col,
dst->tile, src1.tile, src2.tile);<o:p></o:p></p>
<p class="MsoNormal">23 }<o:p></o:p></p>
<p class="MsoNormal">24<o:p></o:p></p>
<p class="MsoNormal">25 __DEFAULT_FN_AMX<o:p></o:p></p>
<p class="MsoNormal">26 void __tile_stored(void *base, long
stride, __tile src) {<o:p></o:p></p>
<p class="MsoNormal">27 _tile_stored_internal(src.row,
src.col, base, stride, src.tile);<o:p></o:p></p>
<p class="MsoNormal">28 }<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">4.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->Example code<o:p></o:p></p>
<p class="MsoNormal">The example shows how to use the user
interface in a function.
<o:p></o:p></p>
<p class="MsoNormal"> 51 void api(int cond, short row, short
col) {<o:p></o:p></p>
<p class="MsoNormal">52 __tile a = {row, col};<o:p></o:p></p>
<p class="MsoNormal">53 __tile b = {row, col};<o:p></o:p></p>
<p class="MsoNormal">54 __tile c = {row, col};<o:p></o:p></p>
<p class="MsoNormal">55<o:p></o:p></p>
<p class="MsoNormal">56 if(cond) {<o:p></o:p></p>
<p class="MsoNormal">57 __tile_loadd(&a, buf,
STRIDE);<o:p></o:p></p>
<p class="MsoNormal">58 __tile_loadd(&b, buf,
STRIDE);<o:p></o:p></p>
<p class="MsoNormal">59 __tile_loadd(&c, buf,
STRIDE);<o:p></o:p></p>
<p class="MsoNormal">60 } else {<o:p></o:p></p>
<p class="MsoNormal">61 __tile_loadd(&a, buf2,
STRIDE);<o:p></o:p></p>
<p class="MsoNormal">62 __tile_loadd(&b, buf2,
STRIDE);<o:p></o:p></p>
<p class="MsoNormal">63 __tile_loadd(&c, buf2,
STRIDE);<o:p></o:p></p>
<p class="MsoNormal">64 }<o:p></o:p></p>
<p class="MsoNormal"><span lang="IT">65
__tile_dpbsud(&c, a, b);</span><o:p></o:p></p>
<p class="MsoNormal">66 __tile_stored(buf, STRIDE, c);<o:p></o:p></p>
<p class="MsoNormal">67 }<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">5.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->LLVM IR<o:p></o:p></p>
<p class="MsoNormal">The LLVM intrinsics IR take the row and
column information as the input parameter, so that
compiler can deduce the shape of tile data. The remaining
parameters are what AMX instructions require. This is the
LLVM IR corresponding to the example code.<o:p></o:p></p>
<p class="MsoNormal">12 define dso_local void @api(i32
%cond, i16 signext %row, i16 signext %col)
local_unnamed_addr #2 {<o:p></o:p></p>
<p class="MsoNormal">13 entry:<o:p></o:p></p>
<p class="MsoNormal">14 %tobool = icmp eq i32 %cond, 0<o:p></o:p></p>
<p class="MsoNormal">15 %sext = shl i16 %col, 8<o:p></o:p></p>
<p class="MsoNormal">16 %conv.i31 = ashr exact i16 %sext,
8<o:p></o:p></p>
<p class="MsoNormal">17 br i1 %tobool, label %if.else,
label %if.then<o:p></o:p></p>
<p class="MsoNormal">18<o:p></o:p></p>
<p class="MsoNormal">19
if.then: ; preds
= %entry<o:p></o:p></p>
<p class="MsoNormal">20 %0 = tail call <256 x i32>
@llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf,
i64 0, i64 0), i64 32) #3<o:p></o:p></p>
<p class="MsoNormal">21 %1 = tail call <256 x i32>
@llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf,
i64 0, i64 0), i64 32) #3<o:p></o:p></p>
<p class="MsoNormal">22 %2 = tail call <256 x i32>
@llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf,
i64 0, i64 0), i64 32) #3<o:p></o:p></p>
<p class="MsoNormal">23 br label %if.end<o:p></o:p></p>
<p class="MsoNormal">24<o:p></o:p></p>
<p class="MsoNormal">25 if.else:
; preds = %entry<o:p></o:p></p>
<p class="MsoNormal">26 %3 = tail call <256 x i32>
@llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2,
i64 0, i64 0), i64 32) #3<o:p></o:p></p>
<p class="MsoNormal">27 %4 = tail call <256 x i32>
@llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2,
i64 0, i64 0), i64 32) #3<o:p></o:p></p>
<p class="MsoNormal">28 %5 = tail call <256 x i32>
@llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2,
i64 0, i64 0), i64 32) #3<o:p></o:p></p>
<p class="MsoNormal">29 br label %if.end<o:p></o:p></p>
<p class="MsoNormal">30<o:p></o:p></p>
<p class="MsoNormal">31
if.end: ; preds
= %if.else, %if.then<o:p></o:p></p>
<p class="MsoNormal">32 %a.sroa.1186.0 = phi <256 x
i32> [ %3, %if.else ], [ %0, %if.then ]<o:p></o:p></p>
<p class="MsoNormal">33 %b.sroa.1068.0 = phi <256 x
i32> [ %4, %if.else ], [ %1, %if.then ]<o:p></o:p></p>
<p class="MsoNormal">34 %c.sroa.1149.0 = phi <256 x
i32> [ %5, %if.else ], [ %2, %if.then ]<o:p></o:p></p>
<p class="MsoNormal">35 %6 = tail call <256 x i32>
@llvm.x86.tdpbssd(i16 %row, i16 %conv.i31, i16 %conv.i31,
<256 x i32> %c.sroa.1149.0, <256 x i32>
%a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3<o:p></o:p></p>
<p class="MsoNormal">36 tail call void
@llvm.x86.tilestored64(i16 %row, i16 %conv.i31, i8*
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf,
i64 0, i64 0), i64 32, <256 x i32> %6) #3<o:p></o:p></p>
<p class="MsoNormal">37 ret void<o:p></o:p></p>
<p class="MsoNormal">38 }<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">6.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->Shape propagation<o:p></o:p></p>
<p class="MsoNormal">When in -O0 build, some general
load/store for tile vector is generated by front-end. We
need to root from AMX intrinsics to propagate the shape
information to the virtual tile register. If the an AMX
intrinsic use the result of load instruction, the shape is
propagated to the load and the load is transformed to tile
load intrinsic. If the store instruction uses any result
of AMX intrinsic, the shape is propagated to store
instruction and the store is transformed to tile store
intrinsic<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">7.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->Machine IR<o:p></o:p></p>
<p class="MsoNormal">Since the AMX intrinsics take the row
and column as the input parameters, we can create a pseudo
instruction corresponding to it. The AMX intrinsics are
lowered to the pseudo AMX instruction which has extra row
and column operands corresponding to AMX intrinsic. The
real AMX instructions don’t need the row and column
operands. The row and column information should be
configured by ldtilecfg before executing any AMX
instruction.<o:p></o:p></p>
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">8.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->Register allocation<o:p></o:p></p>
<p class="MsoNormal">AMX register is special. It needs to be
configured before use and the config instruction is
expensive. To avoid unnecessary tile configure, we collect
the tile shape information as much as possible and combine
them into one ldtilecfg instruction. The ldtilecfg
instruction should dominate any AMX instruction that
access tile register. On the other side, the ldtilecfg
should post-dominated the instruction that define the tile
shape. For tile register spill, it should avoid re-config
due to the different tile shape, the spilled register
should be reloaded to the register that share the same
tile shape. Since tile register allocation is special and
it may allocate general virtual register to configure tile
register, we can add a sperate pass to do it before
general register allocation pass. After register
allocation, the tile shape information is not needed
anymore, so we can transform the pseudo AMX instruction to
real AMX instruction by removing the row and column
operands.<o:p></o:p></p>
</blockquote>
<p>This seems complicated.<o:p></o:p></p>
<p>Reading through the documentation, there appears to be a
single global tile config for all tile registers at any
time.<o:p></o:p></p>
<p>Why not simply model this tile config as a designated
special register and the tile instructions as having an
implicit use of this register? That would seem to ensure
that the register allocator has all the constraints needed.
You'd need to teach it how to spill the special registers
with the appropriate instructions, but that seems a lot more
straight forward?<o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoListParagraph"
style="margin-left:.5in;text-indent:-.25in;mso-list:l0
level1 lfo2">
<!--[if !supportLists]--><span style="mso-list:Ignore">9.<span
style="font:7.0pt "Times New Roman"">
</span></span><!--[endif]-->Use recommendation <o:p></o:p></p>
<p class="MsoNormal">Due to the shape configure issue, we
recommend user to define the tile shape at the entry of
the function entry and inline function as much as
possible. The AMX instructions focus on computation
instead of storage, so global variable for tile data is
not recommended.<o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:10.5pt;line-height:105%"> </span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:10.5pt;line-height:105%">Thanks</span><o:p></o:p></p>
<p class="MsoNormal"><span
style="font-size:10.5pt;line-height:105%">Yuanke</span><o:p></o:p></p>
<p class="MsoNormal"
style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<br>
<br>
<br>
<o:p></o:p></p>
<pre>_______________________________________________<o:p></o:p></pre>
<pre>LLVM Developers mailing list<o:p></o:p></pre>
<pre><a href="mailto:llvm-dev@lists.llvm.org" moz-do-not-send="true">llvm-dev@lists.llvm.org</a><o:p></o:p></pre>
<pre><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></pre>
</blockquote>
</blockquote>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
LLVM Developers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>
<a class="moz-txt-link-freetext" href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>
</pre>
</blockquote>
<pre class="moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</body>
</html>