<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">

  </head>

  <body>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 8/20/20 2:47 PM, Topper, Craig

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:MWHPR11MB0046DC94CD931CD57620FD42935A0@MWHPR11MB0046.namprd11.prod.outlook.com">

      <meta name="Generator" content="Microsoft Word 15 (filtered

        medium)">

      <style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin-top:0in;

        margin-right:0in;

        margin-bottom:8.0pt;

        margin-left:0in;

        line-height:105%;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:#0563C1;

        text-decoration:underline;}

pre

        {mso-style-priority:99;

        mso-style-link:"HTML Preformatted Char";

        margin:0in;

        margin-bottom:.0001pt;

        font-size:10.0pt;

        font-family:"Courier New";}

p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph

        {mso-style-priority:34;

        margin-top:0in;

        margin-right:0in;

        margin-bottom:8.0pt;

        margin-left:0in;

        text-indent:21.0pt;

        line-height:105%;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}

span.HTMLPreformattedChar

        {mso-style-name:"HTML Preformatted Char";

        mso-style-priority:99;

        mso-style-link:"HTML Preformatted";

        font-family:"Courier New";}

span.EmailStyle23

        {mso-style-type:personal-reply;

        font-family:"Calibri",sans-serif;

        color:windowtext;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.WordSection1

        {page:WordSection1;}mso-level-tab-stop:4.5in;

        mso-level-number-position:left;

        text-indent:-.25in;}

ol

        {margin-bottom:0in;}

ul

        {margin-bottom:0in;}</style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

      <div class="WordSection1">

        <p class="MsoNormal">I think I’m still missing something here.

          The configuration is per tile. The multiply instructions take

          a MxK tile and multiply it by a KxN tile and accumulate into

          an MxN tile. So the configuration needs to know how many of

          each size of tile it needs to avoid a spill. Wouldn’t the

          register allocator then need to know which physical tiles have

          been configured to which sizes so that it only chooses those

          tiles for an operand that needs that size?</p>

      </div>

    </blockquote>

    <p><br>

    </p>

    <p>Yes, I think so. But it will because that information is

      essentially encoded in the virtual register classes. I certainly

      could be missing something. It seems like you first figure that

      out, and then you assign virtual tile registers corresponding to

      the correct tile sizes. Perhaps this comes down to what you mean

      by "avoid a spill." We still might spill, and I assume that the

      infrastructure always needs to deal with that. We should continue

      to do instruction scheduling in order to minimize register

      pressure. Once we assign the right virtual register classes to the

      AMX instructions, shouldn't this automatically happen? If we do

      spill, since none of the original live ranges cross the ldtilecfg,

      then there shouldn't be any fundamental issue with using a regular

      load/store spill implementation.</p>

    <p>I'm definitely not an expert in this instruction set, so I may

      just not understand some aspect of this. If there's something I'm

      overlooking, a little example would be helpful.<br>

    </p>

    <p>Thanks again,</p>

    <p>Hal</p>

    <p><br>

    </p>

    <blockquote type="cite"

cite="mid:MWHPR11MB0046DC94CD931CD57620FD42935A0@MWHPR11MB0046.namprd11.prod.outlook.com">

      <div class="WordSection1">

        <p class="MsoNormal"><o:p></o:p></p>

        <p class="MsoNormal">~Craig<o:p></o:p></p>

        <div>

          <div style="border:none;border-top:solid #E1E1E1

            1.0pt;padding:3.0pt 0in 0in 0in">

            <p class="MsoNormal"

              style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

              <b>From:</b> Hal Finkel <a class="moz-txt-link-rfc2396E" href="mailto:hfinkel@anl.gov"><hfinkel@anl.gov></a> <br>

              <b>Sent:</b> Thursday, August 20, 2020 12:35 PM<br>

              <b>To:</b> Topper, Craig <a class="moz-txt-link-rfc2396E" href="mailto:craig.topper@intel.com"><craig.topper@intel.com></a>;

              Kaylor, Andrew <a class="moz-txt-link-rfc2396E" href="mailto:andrew.kaylor@intel.com"><andrew.kaylor@intel.com></a>; Luo,

              Yuanke <a class="moz-txt-link-rfc2396E" href="mailto:yuanke.luo@intel.com"><yuanke.luo@intel.com></a>; Philip Reames

              <a class="moz-txt-link-rfc2396E" href="mailto:listmail@philipreames.com"><listmail@philipreames.com></a>;

              <a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>; <a class="moz-txt-link-abbreviated" href="mailto:florian_hahn@apple.com">florian_hahn@apple.com</a>; Lu,

              Hongjiu <a class="moz-txt-link-rfc2396E" href="mailto:hongjiu.lu@intel.com"><hongjiu.lu@intel.com></a><br>

              <b>Subject:</b> Re: [llvm-dev] Intel AMX programming model

              discussion.<o:p></o:p></p>

          </div>

        </div>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p><o:p> </o:p></p>

        <div>

          <p class="MsoNormal">On 8/19/20 3:09 PM, Topper, Craig wrote:<o:p></o:p></p>

        </div>

        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

          <p class="MsoNormal">The width and height can be runtime

            values that we would just copy into 64 byte configuration

            block we pass to ldtilecfg. So the code doesn’t need to be

            multiversioned. The user code would also use those values to

            update pointers in the loops they write using the tiles. If

            we can’t determine that two tiles were defined with the same

            width and height we need to assume the shape is different

            and try to avoid ever giving the same tile.<o:p></o:p></p>

          <p class="MsoNormal">Hal, for your suggestion would which

            physical registers are in which register class be defined

            dynamically before register allocation?<o:p></o:p></p>

        </blockquote>

        <p><o:p> </o:p></p>

        <p>Here's my thought:<o:p></o:p></p>

        <p>First, you have a set of intrinsics that take tile values

          along with tile configuration parameters (which, presently,

          seem just to be the sizes). These get lowered into

          pseudo-instructions that do the same. Thus, you have some

          register class that represents these arbitrarily-sized tile

          registers that you'll assign to these pseudo-instruction

          operands (i.e., they take virtual tile registers right after

          instruction selection). You might use the 16x16 tile register

          class for this purpose, but it shouldn't really matter.<o:p></o:p></p>

        <p>Second, you run this configuration-placement pass. This pass

          looks at all of the AMX pseudo-instructions and identifies

          regions in which the pseudo-instructions use the same

          configuration parameters (i.e., the same SSA values and/or

          constants). This pass might reorder the pseudo-instructions

          when legal in order to form larger regions. Then it places the

          ldtilecfg at the start of each region (in some common

          dominating position). ldtilecfg implicitly defines all of the

          tile registers in every concrete class of tile registers (all

          256 of them, or whatever). The pseudo-instructions are

          replaced by real MI instructions taking a tile register class

          appropriate for the configuration (which will default to the

          16x16 class for cases where the configuration is not a

          compile-time-known constant). When the configuration is a

          known constant, the instructions take operands with a register

          class appropriate for that configuration (e.g., 1x1, 4x4).<o:p></o:p></p>

        <p>Third, the rest of the framework runs as usual. Tile

          registers from the appropriate class are allocated by the

          register allocator. No live range of any virtual tile register

          can pass through the ldtilecfg (because it defines them all),

          but that's okay, none of live ranges will by construction (the

          configuration-placement pass ensures this).<o:p></o:p></p>

        <p>Thanks again,<o:p></o:p></p>

        <p>Hal<o:p></o:p></p>

        <p><o:p> </o:p></p>

        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

          <p class="MsoNormal"> <o:p></o:p></p>

          <div>

            <div style="border:none;border-top:solid #E1E1E1

              1.0pt;padding:3.0pt 0in 0in 0in">

              <p class="MsoNormal"

                style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                <b>From:</b> Hal Finkel <a

                  href="mailto:hfinkel@anl.gov" moz-do-not-send="true"><hfinkel@anl.gov></a>

                <br>

                <b>Sent:</b> Wednesday, August 19, 2020 12:52 PM<br>

                <b>To:</b> Kaylor, Andrew <a

                  href="mailto:andrew.kaylor@intel.com"

                  moz-do-not-send="true"><andrew.kaylor@intel.com></a>;

                Luo, Yuanke

                <a href="mailto:yuanke.luo@intel.com"

                  moz-do-not-send="true"><yuanke.luo@intel.com></a>;

                Philip Reames <a

                  href="mailto:listmail@philipreames.com"

                  moz-do-not-send="true">

                  <listmail@philipreames.com></a>; <a

                  href="mailto:llvm-dev@lists.llvm.org"

                  moz-do-not-send="true">llvm-dev@lists.llvm.org</a>;

                <a href="mailto:florian_hahn@apple.com"

                  moz-do-not-send="true">florian_hahn@apple.com</a>;

                Topper, Craig

                <a href="mailto:craig.topper@intel.com"

                  moz-do-not-send="true"><craig.topper@intel.com></a>;

                Lu, Hongjiu

                <a href="mailto:hongjiu.lu@intel.com"

                  moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>

                <b>Subject:</b> Re: [llvm-dev] Intel AMX programming

                model discussion.<o:p></o:p></p>

            </div>

          </div>

          <p class="MsoNormal"> <o:p></o:p></p>

          <p> <o:p></o:p></p>

          <div>

            <p class="MsoNormal">On 8/19/20 10:24 AM, Kaylor, Andrew

              wrote:<o:p></o:p></p>

          </div>

          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

            <p>> When the tile shape is unknown at compile time, how

              do you plan to do the register allocation of the tiles? My

              question is: do you do the allocation for this case in the

              same way as you would if you knew the size was 16x16

              (i.e., conservatively assume the largest size)?<o:p></o:p></p>

            <p class="MsoNormal">I think what will happen is that the

              registers are allocated based on a number of runtime

              values that are assumed to be different from one another

              but less than or equal to 16. So, for example, we’ll

              allocate registers for MxN tiles, NxM tiles and MxM tiles

              without knowing what M and N are. Then at runtime the

              values of these variables will be used to create the

              actual tile configuration. The instructions that need to

              know the shape take these runtime values as operands.<o:p></o:p></p>

          </blockquote>

          <p> <o:p></o:p></p>

          <p>So you're going to multiversion the code?<o:p></o:p></p>

          <p>In any case, my point is that you probably don't need a

            custom register allocator. If you just define the tile

            registers and make sure that the ldtilecfgs implicitly

            defines them all, then the regular infrastructure likely

            works. You'll have a bunch of register classes, but that's

            not necessarily a problem. I recommend trying this, and let

            us know what you discover, before we go down the road of a

            new, dedicated allocator just for these registers.<o:p></o:p></p>

          <p> -Hal<o:p></o:p></p>

          <p> <o:p></o:p></p>

          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

            <p class="MsoNormal">There may be some artifacts coming from

              the front end that conservatively assume a 16x16 tile, but

              I think those generally go away in SROA or later

              specialized passes. Yuanke can confirm or correct my

              understanding of this.<o:p></o:p></p>

            <p class="MsoNormal"> <o:p></o:p></p>

            <div>

              <div style="border:none;border-top:solid #E1E1E1

                1.0pt;padding:3.0pt 0in 0in 0in">

                <p class="MsoNormal"

                  style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                  <b>From:</b> Hal Finkel <a

                    href="mailto:hfinkel@anl.gov" moz-do-not-send="true"><hfinkel@anl.gov></a>

                  <br>

                  <b>Sent:</b> Wednesday, August 19, 2020 5:14 AM<br>

                  <b>To:</b> Luo, Yuanke <a

                    href="mailto:yuanke.luo@intel.com"

                    moz-do-not-send="true"><yuanke.luo@intel.com></a>;

                  Kaylor, Andrew

                  <a href="mailto:andrew.kaylor@intel.com"

                    moz-do-not-send="true"><andrew.kaylor@intel.com></a>;

                  Philip Reames

                  <a href="mailto:listmail@philipreames.com"

                    moz-do-not-send="true"><listmail@philipreames.com></a>;

                  <a href="mailto:llvm-dev@lists.llvm.org"

                    moz-do-not-send="true">

                    llvm-dev@lists.llvm.org</a>; <a

                    href="mailto:florian_hahn@apple.com"

                    moz-do-not-send="true">florian_hahn@apple.com</a>;

                  Topper, Craig

                  <a href="mailto:craig.topper@intel.com"

                    moz-do-not-send="true"><craig.topper@intel.com></a>;

                  Lu, Hongjiu

                  <a href="mailto:hongjiu.lu@intel.com"

                    moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>

                  <b>Subject:</b> Re: [llvm-dev] Intel AMX programming

                  model discussion.<o:p></o:p></p>

              </div>

            </div>

            <p class="MsoNormal"> <o:p></o:p></p>

            <p> <o:p></o:p></p>

            <div>

              <p class="MsoNormal">On 8/19/20 5:34 AM, Luo, Yuanke

                wrote:<o:p></o:p></p>

            </div>

            <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

              <p class="MsoNormal">There is no problem to have 256

                register classes. Just a lot of register classes to me.<o:p></o:p></p>

              <p class="MsoNormal">We don’t assume the shape of each

                physical register be 16x16, it is defined by user. For

                variable shape, I mean the shape is known in runtime and

                in compile time the shape is unknown. Take below code as

                an example, the %row and %col are variable instead of

                constant. Compiler recognizes llvm.x86.tileloadd64 and

                deduce the shape of %0 is %row x %col.<o:p></o:p></p>

              <p class="MsoNormal">%0 = tail call <256 x i32>

                @llvm.x86.tileloadd64(i16 %row, i16 %col, i8*

                getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf,

                i64 0, i64 0), i64 32)<o:p></o:p></p>

            </blockquote>

            <p> <o:p></o:p></p>

            <p>When the tile shape is unknown at compile time, how do

              you plan to do the register allocation of the tiles? My

              question is: do you do the allocation for this case in the

              same way as you would if you knew the size was 16x16

              (i.e., conservatively assume the largest size)?<o:p></o:p></p>

            <p>Thanks again,<o:p></o:p></p>

            <p>Hal<o:p></o:p></p>

            <p> <o:p></o:p></p>

            <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

              <p class="MsoNormal"> <o:p></o:p></p>

              <div>

                <div style="border:none;border-top:solid #E1E1E1

                  1.0pt;padding:3.0pt 0in 0in 0in">

                  <p class="MsoNormal"

                    style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                    <b>From:</b> Hal Finkel <a

                      href="mailto:hfinkel@anl.gov"

                      moz-do-not-send="true"><hfinkel@anl.gov></a>

                    <br>

                    <b>Sent:</b> Wednesday, August 19, 2020 4:58 PM<br>

                    <b>To:</b> Luo, Yuanke <a

                      href="mailto:yuanke.luo@intel.com"

                      moz-do-not-send="true"><yuanke.luo@intel.com></a>;

                    Kaylor, Andrew

                    <a href="mailto:andrew.kaylor@intel.com"

                      moz-do-not-send="true"><andrew.kaylor@intel.com></a>;

                    Philip Reames

                    <a href="mailto:listmail@philipreames.com"

                      moz-do-not-send="true"><listmail@philipreames.com></a>;

                    <a href="mailto:llvm-dev@lists.llvm.org"

                      moz-do-not-send="true">

                      llvm-dev@lists.llvm.org</a>; <a

                      href="mailto:florian_hahn@apple.com"

                      moz-do-not-send="true">florian_hahn@apple.com</a>;

                    Topper, Craig

                    <a href="mailto:craig.topper@intel.com"

                      moz-do-not-send="true"><craig.topper@intel.com></a>;

                    Lu, Hongjiu

                    <a href="mailto:hongjiu.lu@intel.com"

                      moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>

                    <b>Subject:</b> Re: [llvm-dev] Intel AMX programming

                    model discussion.<o:p></o:p></p>

                </div>

              </div>

              <p class="MsoNormal"> <o:p></o:p></p>

              <p> <o:p></o:p></p>

              <div>

                <p class="MsoNormal">On 8/19/20 2:21 AM, Luo, Yuanke

                  wrote:<o:p></o:p></p>

              </div>

              <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

                <p class="MsoNormal"> <o:p></o:p></p>

                <p class="MsoNormal">Hi Hal,<o:p></o:p></p>

                <p class="MsoNormal">There is 3 aspect to be solved. <o:p></o:p></p>

                <p class="MsoListParagraph"

                  style="margin-left:.5in;text-indent:-.25in;mso-list:l0

                  level1 lfo2">

                  <!--[if !supportLists]--><span style="mso-list:Ignore">1.<span

                      style="font:7.0pt "Times New Roman"">      

                    </span></span><!--[endif]-->The HW support max shape

                  16x16, so there are many register classes from 1x1 to

                  16x16. We need 256 register classes.

                  <o:p></o:p></p>

                <p class="MsoListParagraph"

                  style="margin-left:.5in;text-indent:-.25in;mso-list:l0

                  level1 lfo2">

                  <!--[if !supportLists]--><span style="mso-list:Ignore">2.<span

                      style="font:7.0pt "Times New Roman"">      

                    </span></span><!--[endif]-->We want to support

                  variable shape, so compiler don’t know what register

                  class to fit tile shape as it is only known in

                  runtime.<o:p></o:p></p>

                <p class="MsoListParagraph"

                  style="margin-left:.5in;text-indent:-.25in;mso-list:l0

                  level1 lfo2">

                  <!--[if !supportLists]--><span style="mso-list:Ignore">3.<span

                      style="font:7.0pt "Times New Roman"">      

                    </span></span><!--[endif]-->The tile configure is to

                  configure physical tile register, so we need to

                  allocate register and then we know the shape of each

                  physical tile register and configure the tile

                  register.<o:p></o:p></p>

                <p class="MsoNormal">I think your suggestion is helpful

                  to reduce the complexity if we only support fixed

                  (constant) tile shape.<o:p></o:p></p>

                <p class="MsoNormal">-Yuanke<o:p></o:p></p>

              </blockquote>

              <p> <o:p></o:p></p>

              <p>Thanks, Yuanke.<o:p></o:p></p>

              <p>It's not clear to me that having 256 register classes

                is, in itself, a problem. Is it?<o:p></o:p></p>

              <p>What does it mean to support variable-shape tiles in

                this context? Do you do something other than

                conservatively assume that they are 16x16 for

                register-allocation purposes?<o:p></o:p></p>

              <p> -Hal<o:p></o:p></p>

              <p> <o:p></o:p></p>

              <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

                <p class="MsoNormal"> <o:p></o:p></p>

                <div>

                  <div style="border:none;border-top:solid #E1E1E1

                    1.0pt;padding:3.0pt 0in 0in 0in">

                    <p class="MsoNormal"

                      style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                      <b>From:</b> Hal Finkel <a

                        href="mailto:hfinkel@anl.gov"

                        moz-do-not-send="true"><hfinkel@anl.gov></a>

                      <br>

                      <b>Sent:</b> Wednesday, August 19, 2020 8:20 AM<br>

                      <b>To:</b> Kaylor, Andrew <a

                        href="mailto:andrew.kaylor@intel.com"

                        moz-do-not-send="true"><andrew.kaylor@intel.com></a>;

                      Philip Reames

                      <a href="mailto:listmail@philipreames.com"

                        moz-do-not-send="true"><listmail@philipreames.com></a>;

                      Luo, Yuanke

                      <a href="mailto:yuanke.luo@intel.com"

                        moz-do-not-send="true"><yuanke.luo@intel.com></a>;

                      <a href="mailto:llvm-dev@lists.llvm.org"

                        moz-do-not-send="true">

                        llvm-dev@lists.llvm.org</a>; <a

                        href="mailto:florian_hahn@apple.com"

                        moz-do-not-send="true">florian_hahn@apple.com</a>;

                      Topper, Craig

                      <a href="mailto:craig.topper@intel.com"

                        moz-do-not-send="true"><craig.topper@intel.com></a>;

                      Lu, Hongjiu

                      <a href="mailto:hongjiu.lu@intel.com"

                        moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>

                      <b>Subject:</b> Re: [llvm-dev] Intel AMX

                      programming model discussion.<o:p></o:p></p>

                  </div>

                </div>

                <p class="MsoNormal"> <o:p></o:p></p>

                <p>Hi, Andy,<o:p></o:p></p>

                <p>I don't quite understand everything that's going on

                  here. Could we model this as:<o:p></o:p></p>

                <p> 1. Define a collection of register classes, one for

                  2x4 tiles, one for 4x2 tiles, etc. each populated with

                  a set of tile registers. Registers can have aliasing

                  relationships (instead of worrying of any kind of

                  subregister/superregister relationships -- these won't

                  be useful anyway).<o:p></o:p></p>

                <p> 2. Define the tile-configuration instructions so

                  that they implicitly define all of the registers in

                  all of the classes.<o:p></o:p></p>

                <p>Then you would still need to pre-schedule the tile

                  operations as you've described, and collect the

                  configuration information in order to add the

                  ldtilecfgs, but the regular register allocator can

                  handle the allocation itself in the usual way. What do

                  you think?<o:p></o:p></p>

                <p> -Hal<o:p></o:p></p>

                <div>

                  <p class="MsoNormal">On 8/18/20 6:58 PM, Kaylor,

                    Andrew via llvm-dev wrote:<o:p></o:p></p>

                </div>

                <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

                  <p class="MsoNormal"

                    style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                    The AMX registers are complicated. The single

                    configuration register (which is mostly used

                    implicitly, similar to MXCSR for floating point)

                    controls the shape of all the tile registers, and if

                    you change the tile configuration every single tile

                    register is cleared. In practice, if we have to

                    change the the configuration while any of the tile

                    registers are live, performance is going to be

                    terrible. We need to handle this case for

                    correctness, but users of this programming interface

                    will need to have enough awareness of the

                    performance issues and the hardware details to

                    prevent this. We’ll also want a diagnostic that lets

                    the user know when this has happened.<o:p></o:p></p>

                  <p class="MsoNormal"

                    style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                     <o:p></o:p></p>

                  <p class="MsoNormal"

                    style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                    When the tile configuration is set, the shape of

                    each tile is locked in, so the individual tile

                    registers aren’t interchangeable at that point. If a

                    function needs 2x4 tiles, 4x2 tiles, and 4x4 tiles,

                    the configuration needs to be set with this in mind.

                    The shape isn’t explicit in every instruction and

                    intrinsic. It must be deduced. And again, we’ll need

                    a way to tell the user when efficient allocation

                    can’t be done. In practice, I don’t expect any

                    function to be using more than three tile shapes.<o:p></o:p></p>

                  <p class="MsoNormal"

                    style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                     <o:p></o:p></p>

                  <p class="MsoNormal"

                    style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                    The implication of all this is that I don’t think

                    the greedy register allocator is well suited to

                    figure all of this out. We need a special pass to

                    pre-allocate these registers. If the function is

                    written in a way that makes good performance

                    possible, it should be a relatively simple task to

                    allocate everything with minimal spilling. If it

                    isn’t possible to get good performance, we don’t

                    need to do anything especially clever. We can just

                    do something straightforward that is correct and let

                    the user know that they aren’t going to be happy

                    with the results.<o:p></o:p></p>

                  <p class="MsoNormal"> <o:p></o:p></p>

                  <p class="MsoNormal">-Andy<o:p></o:p></p>

                  <p class="MsoNormal"> <o:p></o:p></p>

                  <div>

                    <div style="border:none;border-top:solid #E1E1E1

                      1.0pt;padding:3.0pt 0in 0in 0in">

                      <p class="MsoNormal"

                        style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                        <b>From:</b> Philip Reames <a

                          href="mailto:listmail@philipreames.com"

                          moz-do-not-send="true"><listmail@philipreames.com></a>

                        <br>

                        <b>Sent:</b> Friday, August 14, 2020 8:29 PM<br>

                        <b>To:</b> Luo, Yuanke <a

                          href="mailto:yuanke.luo@intel.com"

                          moz-do-not-send="true"><yuanke.luo@intel.com></a>;

                        <a href="mailto:llvm-dev@lists.llvm.org"

                          moz-do-not-send="true">llvm-dev@lists.llvm.org</a>;

                        <a href="mailto:florian_hahn@apple.com"

                          moz-do-not-send="true">

                          florian_hahn@apple.com</a>; Kaylor, Andrew <a

                          href="mailto:andrew.kaylor@intel.com"

                          moz-do-not-send="true">

                          <andrew.kaylor@intel.com></a>; Topper,

                        Craig <a href="mailto:craig.topper@intel.com"

                          moz-do-not-send="true">

                          <craig.topper@intel.com></a>; Lu,

                        Hongjiu <a href="mailto:hongjiu.lu@intel.com"

                          moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>

                        <b>Subject:</b> Re: [llvm-dev] Intel AMX

                        programming model discussion.<o:p></o:p></p>

                    </div>

                  </div>

                  <p class="MsoNormal"> <o:p></o:p></p>

                  <p>I find your answer unconvincing.  I'm not going to

                    debate it as I don't wish to take the time to build

                    the appropriate context, but my initial response is

                    skepticism.<o:p></o:p></p>

                  <p>Philip<o:p></o:p></p>

                  <div>

                    <p class="MsoNormal">On 8/14/20 4:49 PM, Luo, Yuanke

                      wrote:<o:p></o:p></p>

                  </div>

                  <blockquote

                    style="margin-top:5.0pt;margin-bottom:5.0pt">

                    <p class="MsoNormal">[Yuanke] AMX register is

                      special. It needs to be configured before use and

                      the config instruction is expensive. To avoid

                      unnecessary tile configure, we collect the tile

                      shape information as much as possible and combine

                      them into one ldtilecfg instruction. The ldtilecfg

                      instruction should dominate any AMX instruction

                      that access tile register. On the other side, the

                      ldtilecfg should post-dominated the instruction

                      that define the tile shape. For tile register

                      spill, it should avoid re-config due to the

                      different tile shape, the spilled register should

                      be reloaded to the register that share the same

                      tile shape. Since tile register allocation is

                      special and it may allocate general virtual

                      register to configure tile register, we can add a

                      sperate pass to do it before general register

                      allocation pass. After register allocation, the

                      tile shape information is not needed anymore, so

                      we can transform the pseudo AMX instruction to

                      real AMX instruction by removing the row and

                      column operands.<o:p></o:p></p>

                    <p>[Philip]<o:p></o:p></p>

                    <p>This seems complicated.<o:p></o:p></p>

                    <p>Reading through the documentation, there appears

                      to be a single global tile config for all tile

                      registers at any time.<o:p></o:p></p>

                    <p>Why not simply model this tile config as a

                      designated special register and the tile

                      instructions as having an implicit use of this

                      register?  That would seem to ensure that the

                      register allocator has all the constraints

                      needed.  You'd need to teach it how to spill the

                      special registers with the appropriate

                      instructions, but that seems a lot more straight

                      forward?<o:p></o:p></p>

                    <p class="MsoNormal"><span

                        style="font-size:10.5pt;line-height:105%">[Yuanke]

                        In that case user need to configure the tile

                        register by themselves. Spilling configure

                        register is very expensive, because it clears

                        all the tile data register to zero. In our

                        proposal, compiler is responsible to deduce the

                        shape for virtual of tile data register,

                        allocate physical registers for them and then

                        configure those physical register. We may build

                        the dependency as you proposed and it can be

                        used for machine IR check to ensure tile data

                        register is configured before use. </span><o:p></o:p></p>

                    <p class="MsoNormal"><span

                        style="font-size:10.5pt;line-height:105%"> </span><o:p></o:p></p>

                    <div>

                      <div style="border:none;border-top:solid #E1E1E1

                        1.0pt;padding:3.0pt 0in 0in 0in">

                        <p class="MsoNormal"

                          style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                          <b>From:</b> Philip Reames <a

                            href="mailto:listmail@philipreames.com"

                            moz-do-not-send="true"><listmail@philipreames.com></a>

                          <br>

                          <b>Sent:</b> Saturday, August 15, 2020 1:17 AM<br>

                          <b>To:</b> Luo, Yuanke <a

                            href="mailto:yuanke.luo@intel.com"

                            moz-do-not-send="true"><yuanke.luo@intel.com></a>;

                          <a href="mailto:llvm-dev@lists.llvm.org"

                            moz-do-not-send="true">llvm-dev@lists.llvm.org</a>;

                          <a href="mailto:florian_hahn@apple.com"

                            moz-do-not-send="true">

                            florian_hahn@apple.com</a>; Kaylor, Andrew <a

                            href="mailto:andrew.kaylor@intel.com"

                            moz-do-not-send="true">

                            <andrew.kaylor@intel.com></a>; Topper,

                          Craig <a href="mailto:craig.topper@intel.com"

                            moz-do-not-send="true">

                            <craig.topper@intel.com></a>; Lu,

                          Hongjiu <a href="mailto:hongjiu.lu@intel.com"

                            moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>

                          <b>Subject:</b> Re: [llvm-dev] Intel AMX

                          programming model discussion.<o:p></o:p></p>

                      </div>

                    </div>

                    <p class="MsoNormal"> <o:p></o:p></p>

                    <p> <o:p></o:p></p>

                    <div>

                      <p class="MsoNormal">On 8/14/20 6:27 AM, Luo,

                        Yuanke via llvm-dev wrote:<o:p></o:p></p>

                    </div>

                    <blockquote

                      style="margin-top:5.0pt;margin-bottom:5.0pt">

                      <p class="MsoNormal">Hi,<o:p></o:p></p>

                      <p class="MsoNormal">Intel Advanced Matrix

                        Extensions (Intel AMX) is a new programming

                        paradigm consisting of two components: a set of

                        2-dimensional registers (tiles) representing

                        sub-arrays from a larger 2-dimensional memory

                        image, and accelerators able to operate on

                        tiles. Capability of Intel AMX implementation is

                        enumerated by palettes. Two palettes are

                        supported: palette 0 represents the initialized

                        state and palette 1 consists of 8 tile registers

                        of up to 1 KB size, which is controlled by a

                        tile control register.<o:p></o:p></p>

                      <p class="MsoNormal">The instruction manual is

                        posted at <a

href="https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html"

                          moz-do-not-send="true">

https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html</a>.<o:p></o:p></p>

                      <p class="MsoNormal">The AMX abi proposal is

                        posted at <a

                          href="https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4"

                          moz-do-not-send="true">

https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4</a>.<o:p></o:p></p>

                      <p class="MsoNormal">This email is to discuss the

                        programming model for AMX. Florian has

                        introduced the matrix type and intrinsics in

                        LLVM community. We’d like to adopt some ideas

                        from it.<o:p></o:p></p>

                      <p class="MsoNormal">Here is what we propose for

                        the AMX programming model.<o:p></o:p></p>

                      <p class="MsoListParagraph"

                        style="margin-left:.5in;text-indent:-.25in;mso-list:l1

                        level1 lfo4">

                        <!--[if !supportLists]--><span

                          style="mso-list:Ignore">1.<span

                            style="font:7.0pt "Times New

                            Roman"">      

                          </span></span><!--[endif]--> Data type. <o:p></o:p></p>

                      <p class="MsoNormal">We’d like to have fixed

                        vector type for AMX. Since the shape to AMX

                        register can be configurable, the vector size is

                        the maximum size of AMX register. That means the

                        vector size is 1024 bytes.<o:p></o:p></p>

                      <p class="MsoNormal">The C code may look like

                        this.<o:p></o:p></p>

                      <p class="MsoNormal">typedef int _tile_data

                        __attribute__((__vector_size__(1024),

                        __aligned__(64)));<o:p></o:p></p>

                      <p class="MsoNormal">_tile_data tile;<o:p></o:p></p>

                      <p class="MsoNormal">And the LLVM IR may look like

                        this.<o:p></o:p></p>

                      <p class="MsoNormal">@tile = dso_local

                        local_unnamed_addr global <256 x i32>

                        zeroinitializer, align 64<o:p></o:p></p>

                      <p class="MsoNormal">For llvm IR, it is nice to

                        have a new type x86_amxtile that can be mapped

                        to AMX registers.<o:p></o:p></p>

                      <p class="MsoListParagraph"

                        style="margin-left:.5in;text-indent:-.25in;mso-list:l1

                        level1 lfo4">

                        <!--[if !supportLists]--><span

                          style="mso-list:Ignore">2.<span

                            style="font:7.0pt "Times New

                            Roman"">      

                          </span></span><!--[endif]-->AMX Intrinsics. <o:p></o:p></p>

                      <p class="MsoNormal">The internal intrinsics are

                        1:1 mapped to AMX instructions. The parameter m,

                        n, k identifies the shape of the tile. The shape

                        can be variable, but it cannot exceed the size

                        that AMX HW can support. Compiler can deduce

                        shape of the tile from the AMX intrinsics.<o:p></o:p></p>

                      <p class="MsoNormal" style="text-indent:5.5pt">_tile_data

                        _tile_loadd_internal(char m, short n, const void

                        *base, int stride);<o:p></o:p></p>

                      <p class="MsoNormal">_tile_data

                        _tile_dpbssd_internal(char m, short n, short k,

                        _tile_data dst, _tile_data src1, _tile_data

                        src2);<o:p></o:p></p>

                      <p class="MsoNormal">_tile_data

                        _tile_dpbf16ps_internal(char m, short n, short

                        k, _tile_data dst, _tile_data src1, _tile_data

                        src2);<o:p></o:p></p>

                      <p class="MsoNormal">void

                        _tile_stored_internal(char m, short n, void

                        *base, int stride, _tile_data tile);<o:p></o:p></p>

                      <p class="MsoListParagraph"

                        style="margin-left:.5in;text-indent:-.25in;mso-list:l1

                        level1 lfo4">

                        <!--[if !supportLists]--><span

                          style="mso-list:Ignore">3.<span

                            style="font:7.0pt "Times New

                            Roman"">      

                          </span></span><!--[endif]-->User interfaces.<o:p></o:p></p>

                      <p class="MsoNormal">The tile shape and tile data

                        are combined into a struct in C language. The

                        shape of the tile is only allowed to be

                        initialized once. The user interface looks as

                        this.<o:p></o:p></p>

                      <p class="MsoNormal">   3  #define

                        __DEFAULT_FN_AMX    \<o:p></o:p></p>

                      <p class="MsoNormal">   4 

                        __attribute__((__always_inline__, __nodebug__,

                        __target__("amx-int8")))<o:p></o:p></p>

                      <p class="MsoNormal">   9 typedef struct

                        __tile_str {<o:p></o:p></p>

                      <p class="MsoNormal">10   const char row;<o:p></o:p></p>

                      <p class="MsoNormal">11   const short col;<o:p></o:p></p>

                      <p class="MsoNormal">12   _tile_data tile;<o:p></o:p></p>

                      <p class="MsoNormal">13 }__tile;<o:p></o:p></p>

                      <p class="MsoNormal">14<o:p></o:p></p>

                      <p class="MsoNormal">15 __DEFAULT_FN_AMX<o:p></o:p></p>

                      <p class="MsoNormal">16 void __tile_loadd(__tile

                        *dst, const void *base, long stride) {<o:p></o:p></p>

                      <p class="MsoNormal">17   dst->tile =

                        _tile_loadd_internal(dst->row, dst->col,

                        base, stride);<o:p></o:p></p>

                      <p class="MsoNormal">18 }<o:p></o:p></p>

                      <p class="MsoNormal">19<o:p></o:p></p>

                      <p class="MsoNormal">20 __DEFAULT_FN_AMX<o:p></o:p></p>

                      <p class="MsoNormal">21 void __tile_dpbsud(__tile

                        *dst, __tile src1, __tile src2) {<o:p></o:p></p>

                      <p class="MsoNormal">22   dst->tile =

                        _tile_dpbssd_internal(src1.row, src2.col,

                        src1.col, dst->tile, src1.tile, src2.tile);<o:p></o:p></p>

                      <p class="MsoNormal">23 }<o:p></o:p></p>

                      <p class="MsoNormal">24<o:p></o:p></p>

                      <p class="MsoNormal">25 __DEFAULT_FN_AMX<o:p></o:p></p>

                      <p class="MsoNormal">26 void __tile_stored(void

                        *base, long stride, __tile src) {<o:p></o:p></p>

                      <p class="MsoNormal">27  

                        _tile_stored_internal(src.row, src.col, base,

                        stride, src.tile);<o:p></o:p></p>

                      <p class="MsoNormal">28 }<o:p></o:p></p>

                      <p class="MsoNormal"> <o:p></o:p></p>

                      <p class="MsoListParagraph"

                        style="margin-left:.5in;text-indent:-.25in;mso-list:l1

                        level1 lfo4">

                        <!--[if !supportLists]--><span

                          style="mso-list:Ignore">4.<span

                            style="font:7.0pt "Times New

                            Roman"">      

                          </span></span><!--[endif]-->Example code<o:p></o:p></p>

                      <p class="MsoNormal">The example shows how to use

                        the user interface in a function.

                        <o:p></o:p></p>

                      <p class="MsoNormal"> 51 void api(int cond, short

                        row, short col) {<o:p></o:p></p>

                      <p class="MsoNormal">52   __tile a = {row, col};<o:p></o:p></p>

                      <p class="MsoNormal">53   __tile b = {row, col};<o:p></o:p></p>

                      <p class="MsoNormal">54   __tile c = {row, col};<o:p></o:p></p>

                      <p class="MsoNormal">55<o:p></o:p></p>

                      <p class="MsoNormal">56   if(cond) {<o:p></o:p></p>

                      <p class="MsoNormal">57     __tile_loadd(&a,

                        buf, STRIDE);<o:p></o:p></p>

                      <p class="MsoNormal">58     __tile_loadd(&b,

                        buf, STRIDE);<o:p></o:p></p>

                      <p class="MsoNormal">59     __tile_loadd(&c,

                        buf, STRIDE);<o:p></o:p></p>

                      <p class="MsoNormal">60   } else {<o:p></o:p></p>

                      <p class="MsoNormal">61     __tile_loadd(&a,

                        buf2, STRIDE);<o:p></o:p></p>

                      <p class="MsoNormal">62     __tile_loadd(&b,

                        buf2, STRIDE);<o:p></o:p></p>

                      <p class="MsoNormal">63     __tile_loadd(&c,

                        buf2, STRIDE);<o:p></o:p></p>

                      <p class="MsoNormal">64   }<o:p></o:p></p>

                      <p class="MsoNormal"><span lang="IT">65  

                          __tile_dpbsud(&c, a, b);</span><o:p></o:p></p>

                      <p class="MsoNormal">66   __tile_stored(buf,

                        STRIDE, c);<o:p></o:p></p>

                      <p class="MsoNormal">67 }<o:p></o:p></p>

                      <p class="MsoListParagraph"

                        style="margin-left:.5in;text-indent:-.25in;mso-list:l1

                        level1 lfo4">

                        <!--[if !supportLists]--><span

                          style="mso-list:Ignore">5.<span

                            style="font:7.0pt "Times New

                            Roman"">      

                          </span></span><!--[endif]-->LLVM IR<o:p></o:p></p>

                      <p class="MsoNormal">The LLVM intrinsics IR take

                        the row and column information as the input

                        parameter, so that compiler can deduce the shape

                        of tile data. The remaining parameters are what

                        AMX instructions require. This is the LLVM IR

                        corresponding to the example code.<o:p></o:p></p>

                      <p class="MsoNormal">12 define dso_local void

                        @api(i32 %cond, i16 signext %row, i16 signext

                        %col) local_unnamed_addr #2 {<o:p></o:p></p>

                      <p class="MsoNormal">13 entry:<o:p></o:p></p>

                      <p class="MsoNormal">14   %tobool = icmp eq i32

                        %cond, 0<o:p></o:p></p>

                      <p class="MsoNormal">15   %sext = shl i16 %col, 8<o:p></o:p></p>

                      <p class="MsoNormal">16   %conv.i31 = ashr exact

                        i16 %sext, 8<o:p></o:p></p>

                      <p class="MsoNormal">17   br i1 %tobool, label

                        %if.else, label %if.then<o:p></o:p></p>

                      <p class="MsoNormal">18<o:p></o:p></p>

                      <p class="MsoNormal">19

                        if.then:                                         

                        ; preds = %entry<o:p></o:p></p>

                      <p class="MsoNormal">20   %0 = tail call <256 x

                        i32> @llvm.x86.tileloadd64(i16 %row, i16

                        %conv.i31, i8* getelementptr inbounds ([1024 x

                        i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)

                        #3<o:p></o:p></p>

                      <p class="MsoNormal">21   %1 = tail call <256 x

                        i32> @llvm.x86.tileloadd64(i16 %row, i16

                        %conv.i31, i8* getelementptr inbounds ([1024 x

                        i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)

                        #3<o:p></o:p></p>

                      <p class="MsoNormal">22   %2 = tail call <256 x

                        i32> @llvm.x86.tileloadd64(i16 %row, i16

                        %conv.i31, i8* getelementptr inbounds ([1024 x

                        i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)

                        #3<o:p></o:p></p>

                      <p class="MsoNormal">23   br label %if.end<o:p></o:p></p>

                      <p class="MsoNormal">24<o:p></o:p></p>

                      <p class="MsoNormal">25

                        if.else:                     

                                            ; preds = %entry<o:p></o:p></p>

                      <p class="MsoNormal">26   %3 = tail call <256 x

                        i32> @llvm.x86.tileloadd64(i16 %row, i16

                        %conv.i31, i8* getelementptr inbounds ([1024 x

                        i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32)

                        #3<o:p></o:p></p>

                      <p class="MsoNormal">27   %4 = tail call <256 x

                        i32> @llvm.x86.tileloadd64(i16 %row, i16

                        %conv.i31, i8* getelementptr inbounds ([1024 x

                        i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32)

                        #3<o:p></o:p></p>

                      <p class="MsoNormal">28   %5 = tail call <256 x

                        i32> @llvm.x86.tileloadd64(i16 %row, i16

                        %conv.i31, i8* getelementptr inbounds ([1024 x

                        i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32)

                        #3<o:p></o:p></p>

                      <p class="MsoNormal">29   br label %if.end<o:p></o:p></p>

                      <p class="MsoNormal">30<o:p></o:p></p>

                      <p class="MsoNormal">31

                        if.end:                                          

                        ; preds = %if.else, %if.then<o:p></o:p></p>

                      <p class="MsoNormal">32   %a.sroa.1186.0 = phi

                        <256 x i32> [ %3, %if.else ], [ %0,

                        %if.then ]<o:p></o:p></p>

                      <p class="MsoNormal">33   %b.sroa.1068.0 = phi

                        <256 x i32> [ %4, %if.else ], [ %1,

                        %if.then ]<o:p></o:p></p>

                      <p class="MsoNormal">34   %c.sroa.1149.0 = phi

                        <256 x i32> [ %5, %if.else ], [ %2,

                        %if.then ]<o:p></o:p></p>

                      <p class="MsoNormal">35   %6 = tail call <256 x

                        i32> @llvm.x86.tdpbssd(i16 %row, i16

                        %conv.i31, i16 %conv.i31, <256 x i32>

                        %c.sroa.1149.0, <256 x i32>

                        %a.sroa.1186.0, <256 x i32>

                        %b.sroa.1068.0) #3<o:p></o:p></p>

                      <p class="MsoNormal">36   tail call void

                        @llvm.x86.tilestored64(i16 %row, i16 %conv.i31,

                        i8* getelementptr inbounds ([1024 x i8], [1024 x

                        i8]* @buf, i64 0, i64 0), i64 32, <256 x

                        i32> %6) #3<o:p></o:p></p>

                      <p class="MsoNormal">37   ret void<o:p></o:p></p>

                      <p class="MsoNormal">38 }<o:p></o:p></p>

                      <p class="MsoListParagraph"

                        style="margin-left:.5in;text-indent:-.25in;mso-list:l1

                        level1 lfo4">

                        <!--[if !supportLists]--><span

                          style="mso-list:Ignore">6.<span

                            style="font:7.0pt "Times New

                            Roman"">      

                          </span></span><!--[endif]-->Shape propagation<o:p></o:p></p>

                      <p class="MsoNormal">When in -O0 build, some

                        general load/store for tile vector is generated

                        by front-end. We need to root from AMX

                        intrinsics to propagate the shape information to

                        the virtual tile register. If the an AMX

                        intrinsic use the result of load instruction,

                        the shape is propagated to the load and the load

                        is transformed to tile load intrinsic. If the

                        store instruction uses any result of AMX

                        intrinsic, the shape is propagated to store

                        instruction and the store is transformed to tile

                        store intrinsic<o:p></o:p></p>

                      <p class="MsoListParagraph"

                        style="margin-left:.5in;text-indent:-.25in;mso-list:l1

                        level1 lfo4">

                        <!--[if !supportLists]--><span

                          style="mso-list:Ignore">7.<span

                            style="font:7.0pt "Times New

                            Roman"">      

                          </span></span><!--[endif]-->Machine IR<o:p></o:p></p>

                      <p class="MsoNormal">Since the AMX intrinsics take

                        the row and column as the input parameters, we

                        can create a pseudo instruction corresponding to

                        it. The AMX intrinsics are lowered to the pseudo

                        AMX instruction which has extra row and column

                        operands corresponding to AMX intrinsic. The

                        real AMX instructions don’t need the row and

                        column operands. The row and column information

                        should be configured by ldtilecfg before

                        executing any AMX instruction.<o:p></o:p></p>

                      <p class="MsoListParagraph"

                        style="margin-left:.5in;text-indent:-.25in;mso-list:l1

                        level1 lfo4">

                        <!--[if !supportLists]--><span

                          style="mso-list:Ignore">8.<span

                            style="font:7.0pt "Times New

                            Roman"">      

                          </span></span><!--[endif]-->Register

                        allocation<o:p></o:p></p>

                      <p class="MsoNormal">AMX register is special. It

                        needs to be configured before use and the config

                        instruction is expensive. To avoid unnecessary

                        tile configure, we collect the tile shape

                        information as much as possible and combine them

                        into one ldtilecfg instruction. The ldtilecfg

                        instruction should dominate any AMX instruction

                        that access tile register. On the other side,

                        the ldtilecfg should post-dominated the

                        instruction that define the tile shape. For tile

                        register spill, it should avoid re-config due to

                        the different tile shape, the spilled register

                        should be reloaded to the register that share

                        the same tile shape. Since tile register

                        allocation is special and it may allocate

                        general virtual register to configure tile

                        register, we can add a sperate pass to do it

                        before general register allocation pass. After

                        register allocation, the tile shape information

                        is not needed anymore, so we can transform the

                        pseudo AMX instruction to real AMX instruction

                        by removing the row and column operands.<o:p></o:p></p>

                    </blockquote>

                    <p>This seems complicated.<o:p></o:p></p>

                    <p>Reading through the documentation, there appears

                      to be a single global tile config for all tile

                      registers at any time.<o:p></o:p></p>

                    <p>Why not simply model this tile config as a

                      designated special register and the tile

                      instructions as having an implicit use of this

                      register?  That would seem to ensure that the

                      register allocator has all the constraints

                      needed.  You'd need to teach it how to spill the

                      special registers with the appropriate

                      instructions, but that seems a lot more straight

                      forward?<o:p></o:p></p>

                    <blockquote

                      style="margin-top:5.0pt;margin-bottom:5.0pt">

                      <p class="MsoListParagraph"

                        style="margin-left:.5in;text-indent:-.25in;mso-list:l1

                        level1 lfo4">

                        <!--[if !supportLists]--><span

                          style="mso-list:Ignore">9.<span

                            style="font:7.0pt "Times New

                            Roman"">      

                          </span></span><!--[endif]-->Use recommendation

                        <o:p></o:p></p>

                      <p class="MsoNormal">Due to the shape configure

                        issue, we recommend user to define the tile

                        shape at the entry of the function entry and

                        inline function as much as possible. The AMX

                        instructions focus on computation instead of

                        storage, so global variable for tile data is not

                        recommended.<o:p></o:p></p>

                      <p class="MsoNormal"><span

                          style="font-size:10.5pt;line-height:105%"> </span><o:p></o:p></p>

                      <p class="MsoNormal"><span

                          style="font-size:10.5pt;line-height:105%">Thanks</span><o:p></o:p></p>

                      <p class="MsoNormal"><span

                          style="font-size:10.5pt;line-height:105%">Yuanke</span><o:p></o:p></p>

                      <p class="MsoNormal"

                        style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                        <br>

                        <br>

                        <br>

                        <br>

                        <br>

                        <br>

                        <br>

                        <br>

                        <o:p></o:p></p>

                      <pre>_______________________________________________<o:p></o:p></pre>

                      <pre>LLVM Developers mailing list<o:p></o:p></pre>

                      <pre><a href="mailto:llvm-dev@lists.llvm.org" moz-do-not-send="true">llvm-dev@lists.llvm.org</a><o:p></o:p></pre>

                      <pre><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></pre>

                    </blockquote>

                  </blockquote>

                  <p class="MsoNormal"

                    style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                    <br>

                    <br>

                    <br>

                    <br>

                    <br>

                    <br>

                    <o:p></o:p></p>

                  <pre>_______________________________________________<o:p></o:p></pre>

                  <pre>LLVM Developers mailing list<o:p></o:p></pre>

                  <pre><a href="mailto:llvm-dev@lists.llvm.org" moz-do-not-send="true">llvm-dev@lists.llvm.org</a><o:p></o:p></pre>

                  <pre><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></pre>

                </blockquote>

                <pre>-- <o:p></o:p></pre>

                <pre>Hal Finkel<o:p></o:p></pre>

                <pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>

                <pre>Leadership Computing Facility<o:p></o:p></pre>

                <pre>Argonne National Laboratory<o:p></o:p></pre>

              </blockquote>

              <pre>-- <o:p></o:p></pre>

              <pre>Hal Finkel<o:p></o:p></pre>

              <pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>

              <pre>Leadership Computing Facility<o:p></o:p></pre>

              <pre>Argonne National Laboratory<o:p></o:p></pre>

            </blockquote>

            <pre>-- <o:p></o:p></pre>

            <pre>Hal Finkel<o:p></o:p></pre>

            <pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>

            <pre>Leadership Computing Facility<o:p></o:p></pre>

            <pre>Argonne National Laboratory<o:p></o:p></pre>

          </blockquote>

          <pre>-- <o:p></o:p></pre>

          <pre>Hal Finkel<o:p></o:p></pre>

          <pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>

          <pre>Leadership Computing Facility<o:p></o:p></pre>

          <pre>Argonne National Laboratory<o:p></o:p></pre>

        </blockquote>

        <pre>-- <o:p></o:p></pre>

        <pre>Hal Finkel<o:p></o:p></pre>

        <pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>

        <pre>Leadership Computing Facility<o:p></o:p></pre>

        <pre>Argonne National Laboratory<o:p></o:p></pre>

      </div>

    </blockquote>

    <pre class="moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

  </body>

</html>