<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:SimSun;
        panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:"\@SimSun";
        panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
        {font-family:"Segoe UI";
        panose-1:2 11 5 2 4 2 4 2 2 3;}
@font-face
        {font-family:Consolas;
        panose-1:2 11 6 9 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:#0563C1;
        text-decoration:underline;}
pre
        {mso-style-priority:99;
        mso-style-link:"HTML Preformatted Char";
        margin:0in;
        font-size:10.0pt;
        font-family:"Courier New";}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri",sans-serif;
        color:windowtext;}
span.HTMLPreformattedChar
        {mso-style-name:"HTML Preformatted Char";
        mso-style-priority:99;
        mso-style-link:"HTML Preformatted";
        font-family:"Courier New";}
span.phui-tag-core
        {mso-style-name:phui-tag-core;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri",sans-serif;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
/* List Definitions */
@list l0
        {mso-list-id:1937517928;
        mso-list-type:hybrid;
        mso-list-template-ids:-418626020 67698705 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
        {mso-level-text:"%1\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level2
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level5
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level8
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
ol
        {margin-bottom:0in;}
ul
        {margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p style="mso-margin-top-alt:0in;margin-right:0in;margin-bottom:9.0pt;margin-left:0in;background:white">
<b><span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif;color:black">[X86 AMX]O0: Support AMX fast register allocation<br>
</span></b><span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif;color:black">The amx programming model that discussed in llvm-dev<br>
(<a href="http://lists.llvm.org/pipermail/llvm-dev/2020-August/144302.html" target="_blank"><span style="color:#136CB2">http://lists.llvm.org/pipermail/llvm-dev/2020-August/144302.html</span></a>).<o:p></o:p></span></p>
<p style="mso-margin-top-alt:0in;margin-right:0in;margin-bottom:9.0pt;margin-left:0in;background:white">
<span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif;color:black"><o:p> </o:p></span></p>
<p style="mso-margin-top-alt:0in;margin-right:0in;margin-bottom:9.0pt;margin-left:0in;background:white">
<span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif;color:black">Some features of building AMX at O0 level:<o:p></o:p></span></p>
<ol style="margin-top:0in" start="1" type="1">
<li style="color:black;margin-top:0in;margin-bottom:9.0pt;mso-list:l0 level1 lfo1;background:white">
<span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif">Shapes of Tiles are very hard to compare.<o:p></o:p></span></li><li style="color:black;margin-top:0in;margin-bottom:9.0pt;mso-list:l0 level1 lfo1;background:white">
<span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif">Live range of tiles register usually very short.<o:p></o:p></span></li><li style="color:black;margin-top:0in;margin-bottom:9.0pt;mso-list:l0 level1 lfo1;background:white">
<span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif">AMX memory operation instructions (e.g. TileLoad/Store) used index register as step, which is trouble in faster register allocation.<o:p></o:p></span></li></ol>
<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif;color:black">           (</span>Detail:  When we Spill tmm registers in fast reg allocation.   Similar with other registers, we need generate tilestore/load for the spilled
 tmm registers.<o:p></o:p></p>
<p class="MsoNormal">                for example:<o:p></o:p></p>
<p class="MsoNormal">                TILESTORED %stack.2, 1,  %16:gr64_nosp, 0, $noreg, killed $tmm0<o:p></o:p></p>
<p class="MsoNormal">                 …..<o:p></o:p></p>
<p class="MsoNormal">              $tmm0 = TILELOADD %stack.2, 1, %16:gr64_nosp, 0, $noreg :: (load 1024 from %stack.2)<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">               We need to make sure there is an useable index register for tile mem.<o:p></o:p></p>
<p class="MsoNormal">              And let the useable index register = Stride<o:p></o:p></p>
<p class="MsoNormal">              (but registers has allocated!)      <span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif;color:black">
)<o:p></o:p></span></p>
<p style="mso-margin-top-alt:0in;margin-right:0in;margin-bottom:9.0pt;margin-left:0in;background:white">
<span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif;color:black"><br>
<strong><span style="font-family:"Segoe UI",sans-serif">1></span></strong><br>
In O0 level, for the customers usually means clang –O0 –S/-c (Front End and Back end both compile in O0 level):<br>
The tile data of amx intrinsic must be loaded before uses, and store into mem after define a tile register.<br>
Some like<o:p></o:p></span></p>
<pre style="mso-line-height-alt:11.25pt;background:white rgba(71, 87, 120, 0.08);border-radius: 3px;white-space:pre-wrap;overflow:auto"><span style="font-size:12.0pt;font-family:Consolas;color:black">----------------------------------------------------------------------<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">%t1 = call x86_amx @llvm.x86.tileloadd64.internal(m, k, ...)<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">%t2 = call x86_amx @llvm.x86.tileloadd64.internal(k, n, ...)<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">%t3 = call x86_amx @llvm.x86.tileloadd64.internal(m, n, ...)<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">%td = call x86_amx @llvm.x86.tdpbssd.internal(m, n, k, t1, t2, t3)    // key amx intrinsic<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">call void @llvm.x86.tilestored64.internal(... td)<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">----------------------------------------------------------------------<o:p></o:p></span></pre>
<p style="mso-margin-top-alt:0in;margin-right:0in;margin-bottom:9.0pt;margin-left:0in;background:white;font-variant-ligatures: normal;font-variant-caps: normal;orphans: 2;widows: 2;-webkit-text-stroke-width: 0px;text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;word-spacing:0px">
<span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif;color:black">Because the life range of tile register is very short (from tileload to tilestore, impossible to spill), we let fast register allocation directly allocate tile registers for them.<o:p></o:p></span></p>
<p style="mso-margin-top-alt:0in;margin-right:0in;margin-bottom:9.0pt;margin-left:0in;background:white;font-variant-ligatures: normal;font-variant-caps: normal;orphans: 2;widows: 2;-webkit-text-stroke-width: 0px;text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;word-spacing:0px">
<span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif;color:black">As the AMX programming model above show, we need ldtilecfg for each tile register before using them.<br>
So we insert ldtilecfg for every key amx intrinsic (There are 2 reasons do it:<br>
1,we don't much care about the performance at O0. 2,The shapes are very hard to compare at O0 level )<br>
e.g.<o:p></o:p></span></p>
<pre style="mso-line-height-alt:11.25pt;background:white rgba(71, 87, 120, 0.08);border-radius: 3px;white-space:pre-wrap;overflow:auto"><span style="font-size:12.0pt;font-family:Consolas;color:black">----------------------------------------------------------------------<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">%cfgmem = alloca <16 x i32>, align 4<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">store <16 x i32> zeroinitializer, <16 x i32>* %cfgmem<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">call void @llvm.x86.ldtilecfg.internal(i8* %cfgmem)<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">---------------------------------------------------------------------<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">%t1 = call x86_amx @llvm.x86.tileloadd64.internal(m, k, ...)<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">%t2 = call x86_amx @llvm.x86.tileloadd64.internal(k, n, ...)<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">%t3 = call x86_amx @llvm.x86.tileloadd64.internal(m, n, ...)<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">%td = call x86_amx @llvm.x86.tdpbssd.internal(m, n, k, t1, t2, t3)    // key amx intrinsic<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">call void @llvm.x86.tilestored64.internal(... td)<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">-------------------------------------------------------------------------<o:p></o:p></span></pre>
<p style="mso-margin-top-alt:0in;margin-right:0in;margin-bottom:9.0pt;margin-left:0in;background:white;font-variant-ligatures: normal;font-variant-caps: normal;orphans: 2;widows: 2;-webkit-text-stroke-width: 0px;text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;word-spacing:0px">
<span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif;color:black">But the ldtilecfg need to write the shapes of tile register in its config mem, then we write the shapes before fast register allocation. (it is trouble to do it after register allocation,
 because the shapes register relocated for AMXinstrinsics may not live at writing position.) But currently, we don’t know for which physic tile register we write the virtual register of shapes ,(because it is before register allocation). So, we just orderly
 write these shapes into config memory:<br>
e.g.<o:p></o:p></span></p>
<pre style="mso-line-height-alt:11.25pt;background:white rgba(71, 87, 120, 0.08);border-radius: 3px;white-space:pre-wrap;overflow:auto"><span style="font-size:12.0pt;font-family:Consolas;color:black">----------------------------------------------------------------------<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">%cfgmem = alloca <16 x i32>, align 4                                 * allocate mem<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">store <16 x i32> zeroinitializer, <16 x i32>* %cfgmem                * zero init<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">...<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">//pre-config shape of %t1                                            *<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">store volatile i8 %m, i8* %amx.tmm.0.shape.row, align 1              *<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">store volatile i16 %k, i16* %amx.tmm.0.shape.col, align 2            * pre-config<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">// pre-config shape of %t2                                           * shapes<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">store volatile i8 %k, i8* %amx.tmm.1.shape.row, align 1              *<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">store volatile i16 %n, i16* %amx.tmm.1.shape.col, align 2            *<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">// pre-config shape of %t3, %td                                      *<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">            ….<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">call void @llvm.x86.ldtilecfg.internal(i8* %cfgmem)                  * tile config<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">-------------------------------------------------------------------------<o:p></o:p></span></pre>
<p style="mso-margin-top-alt:0in;margin-right:0in;margin-bottom:9.0pt;margin-left:0in;background:white;font-variant-ligatures: normal;font-variant-caps: normal;orphans: 2;widows: 2;-webkit-text-stroke-width: 0px;text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;word-spacing:0px">
<span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif;color:black">And then adjust them after fast register allocation.<br>
e.g.<br>
We supposed written the first shape into %amx.tmm.0.shape.row (base + 48), but after fast register allocation if we find the first shape is not corresponding to the first tile register (tmm0), it is corresponding to the 2nd tile register (tmm1), we will adjust
 the written mem to %amx.tmm.1.shape.row (base + 48 +1).<o:p></o:p></span></p>
<pre style="mso-line-height-alt:11.25pt;background:white rgba(71, 87, 120, 0.08);border-radius: 3px;white-space:pre-wrap;overflow:auto"><span style="font-size:12.0pt;font-family:Consolas;color:black">---------------------------------------------------------------------------<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">MOV8mi %stack.5, 1, $noreg, 49, $noreg, 8 :: (volatile store 1 into %ir.amx.tmm.0.shape.row)<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">MOV16mr %stack.5, 1, $noreg, 18, $noreg, renamable $cx :: (volatile store 2 into %ir.amx.tmm.0.shape.col)<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">     …<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">PLDTILECFGV killed renamable $rsi, 1, $noreg, 0, $noreg<o:p></o:p></span></pre>
<pre style="mso-line-height-alt:11.25pt;background:white"><span style="font-size:12.0pt;font-family:Consolas;color:black">--------------------------------------------------------------------------<o:p></o:p></span></pre>
<p style="mso-margin-top-alt:0in;margin-right:0in;margin-bottom:9.0pt;margin-left:0in;background:white;font-variant-ligatures: normal;font-variant-caps: normal;orphans: 2;widows: 2;-webkit-text-stroke-width: 0px;text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;word-spacing:0px">
<strong><span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif;color:black">2></span></strong><span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif;color:black"><br>
For the customers, they usually use clang –O0 –S/-c (Front End and Back end both compile in O0 level).<br>
But for llvm developers, we may usually let Front End build with <a href="https://reviews.llvm.org/owners/package/1/"><span class="phui-tag-core"><b><span style="color:black;border:solid #C7CCD9 1.0pt;padding:0in;background:#EBECEE;text-decoration:none">O1</span></b></span></a>/2/…
 and Back End build in O0 (e.g.: clang –O0 –S –emit-llvm + llc –O0)<o:p></o:p></span></p>
<p style="margin:0in;background:white;font-variant-ligatures: normal;font-variant-caps: normal;orphans: 2;widows: 2;-webkit-text-stroke-width: 0px;text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;word-spacing:0px">
<span style="font-size:12.0pt;font-family:"Segoe UI",sans-serif;color:black">Considering this way is not the main way of building program and let the upper algorithm works too, I “volatiles” the tile data of key AMX intrinsic in pass “Lower AMX type for load/store”,
 just let it like in clang –O0, all tile data of key AMX intrinsic must be loaded before uses, and stored into mem after define a tile register. Because the Back End build it in O0, so here we don’t consider the performance, just care about the correctness.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:12.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:12.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><b><span style="font-size:12.0pt">I first implemented the design at</span></b><span style="font-size:12.0pt">
<a href="https://reviews.llvm.org/D100026">https://reviews.llvm.org/D100026</a><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:12.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:12.0pt">BR!<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:12.0pt">Thank you!<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:12.0pt">Xiang<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:12.0pt"><o:p> </o:p></span></p>
</div>
</body>
</html>