<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 8/20/20 3:50 PM, Topper, Craig
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:MWHPR11MB004601229570D3828D7FBCC6935A0@MWHPR11MB0046.namprd11.prod.outlook.com">
      
      <meta name="Generator" content="Microsoft Word 15 (filtered
        medium)">
      <style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin-top:0in;
        margin-right:0in;
        margin-bottom:8.0pt;
        margin-left:0in;
        line-height:105%;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:#0563C1;
        text-decoration:underline;}
pre
        {mso-style-priority:99;
        mso-style-link:"HTML Preformatted Char";
        margin:0in;
        margin-bottom:.0001pt;
        font-size:10.0pt;
        font-family:"Courier New";}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin-top:0in;
        margin-right:0in;
        margin-bottom:8.0pt;
        margin-left:0in;
        text-indent:21.0pt;
        line-height:105%;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
span.HTMLPreformattedChar
        {mso-style-name:"HTML Preformatted Char";
        mso-style-priority:99;
        mso-style-link:"HTML Preformatted";
        font-family:"Courier New";}
span.EmailStyle23
        {mso-style-type:personal-reply;
        font-family:"Calibri",sans-serif;
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}mso-level-tab-stop:4.5in;
        mso-level-number-position:left;
        text-indent:-.25in;}
ol
        {margin-bottom:0in;}
ul
        {margin-bottom:0in;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
      <div class="WordSection1">
        <p class="MsoNormal">Ignore my spill comment for now. That’s
          more of an optimization.<o:p></o:p></p>
        <p class="MsoNormal">Lets say I have a 2x3 tile a 3x2 tile and I
          multiply them to make a 2x2 tile. I have 3 different sizes of
          tiles. So my instruction uses 3 different register classes for
          its virtual registers.<o:p></o:p></p>
        <p class="MsoNormal">The pass that inserts the ldtilecfg needs
          to configure the physical tiles so lets say it configures tmm0
          to 2x3, tmm1 to 3x2 and tmm2 to 2x2.<o:p></o:p></p>
        <p class="MsoNormal">Register classes as I know them in llvm
          have a static list of physical registers in them. So all 3 of
          the register classes for my virtual registers contain all 8
          physical tmm registers? How does the register allocator know
          to use tmm0 for the 2x3 virtual register, and tmm1 for the 3x2
          virtual register, and tmm2 for the 2x2 virtual register.<o:p></o:p></p>
        <p class="MsoNormal">~Craig</p>
      </div>
    </blockquote>
    <p><br>
    </p>
    <p>Ah, okay. I think I see why we're not on the same page. The
      architectural definition has 8 files registers, tmm0-tmm7, but I
      was thinking that you would not model it that way. Instead, we
      could have registers:</p>
    <p>tmm0_1x1 ... tmm7_1x1</p>
    <p>...</p>
    <p>tmm0_16x16 ... tmm7_16x16</p>
    <p>where tmm0_1x1 as aliases of tmm0_1x2, ... tmm0_16x16, and so on.<br>
    </p>
    <p>and corresponding register classes RegClassTmm1x1, ...,
      RegClassTmm16x16 (I don't mean to imply this exact naming
      convention). So, within each region, you assign the relevant
      virtual registers to have a register class of RegClassTmm1x1, or
      whatever, and then once register allocation is done, you adjust
      the ldtilecfg data for each region so that it actually makes
      whatever registers were assigned by the right tile sizes.</p>
    <p>You would not want to have N^2 version of all of the instructions
      either, but I think you can just have the instructions defined to
      take some overall register class (containing all of the registers)
      and then you can call constrainRegClass in the
      configuration-placement pass.</p>
    <p>Thinking about it however, maybe having the different physical
      registers isn't actually needed. If you know which tile config
      each register needed based on the instructions, maybe you can have
      only 8 of them and just update the ldtilecfg based on the usage
      information after allocation regardless.</p>
    <p> -Hal<br>
    </p>
    <p><br>
    </p>
    <blockquote type="cite"
cite="mid:MWHPR11MB004601229570D3828D7FBCC6935A0@MWHPR11MB0046.namprd11.prod.outlook.com">
      <div class="WordSection1">
        <p class="MsoNormal"><o:p></o:p></p>
        <div>
          <div style="border:none;border-top:solid #E1E1E1
            1.0pt;padding:3.0pt 0in 0in 0in">
            <p class="MsoNormal"
              style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
              <b>From:</b> Hal Finkel <a class="moz-txt-link-rfc2396E" href="mailto:hfinkel@anl.gov"><hfinkel@anl.gov></a> <br>
              <b>Sent:</b> Thursday, August 20, 2020 1:27 PM<br>
              <b>To:</b> Topper, Craig <a class="moz-txt-link-rfc2396E" href="mailto:craig.topper@intel.com"><craig.topper@intel.com></a>;
              Kaylor, Andrew <a class="moz-txt-link-rfc2396E" href="mailto:andrew.kaylor@intel.com"><andrew.kaylor@intel.com></a>; Luo,
              Yuanke <a class="moz-txt-link-rfc2396E" href="mailto:yuanke.luo@intel.com"><yuanke.luo@intel.com></a>; Philip Reames
              <a class="moz-txt-link-rfc2396E" href="mailto:listmail@philipreames.com"><listmail@philipreames.com></a>;
              <a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>; <a class="moz-txt-link-abbreviated" href="mailto:florian_hahn@apple.com">florian_hahn@apple.com</a>; Lu,
              Hongjiu <a class="moz-txt-link-rfc2396E" href="mailto:hongjiu.lu@intel.com"><hongjiu.lu@intel.com></a><br>
              <b>Subject:</b> Re: [llvm-dev] Intel AMX programming model
              discussion.<o:p></o:p></p>
          </div>
        </div>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p><o:p> </o:p></p>
        <div>
          <p class="MsoNormal">On 8/20/20 2:47 PM, Topper, Craig wrote:<o:p></o:p></p>
        </div>
        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
          <p class="MsoNormal">I think I’m still missing something here.
            The configuration is per tile. The multiply instructions
            take a MxK tile and multiply it by a KxN tile and accumulate
            into an MxN tile. So the configuration needs to know how
            many of each size of tile it needs to avoid a spill.
            Wouldn’t the register allocator then need to know which
            physical tiles have been configured to which sizes so that
            it only chooses those tiles for an operand that needs that
            size?<o:p></o:p></p>
        </blockquote>
        <p><o:p> </o:p></p>
        <p>Yes, I think so. But it will because that information is
          essentially encoded in the virtual register classes. I
          certainly could be missing something. It seems like you first
          figure that out, and then you assign virtual tile registers
          corresponding to the correct tile sizes. Perhaps this comes
          down to what you mean by "avoid a spill." We still might
          spill, and I assume that the infrastructure always needs to
          deal with that. We should continue to do instruction
          scheduling in order to minimize register pressure. Once we
          assign the right virtual register classes to the AMX
          instructions, shouldn't this automatically happen? If we do
          spill, since none of the original live ranges cross the
          ldtilecfg, then there shouldn't be any fundamental issue with
          using a regular load/store spill implementation.<o:p></o:p></p>
        <p>I'm definitely not an expert in this instruction set, so I
          may just not understand some aspect of this. If there's
          something I'm overlooking, a little example would be helpful.<o:p></o:p></p>
        <p>Thanks again,<o:p></o:p></p>
        <p>Hal<o:p></o:p></p>
        <p><o:p> </o:p></p>
        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
          <p class="MsoNormal">~Craig<o:p></o:p></p>
          <div>
            <div style="border:none;border-top:solid #E1E1E1
              1.0pt;padding:3.0pt 0in 0in 0in">
              <p class="MsoNormal"
                style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
                <b>From:</b> Hal Finkel <a
                  href="mailto:hfinkel@anl.gov" moz-do-not-send="true"><hfinkel@anl.gov></a>
                <br>
                <b>Sent:</b> Thursday, August 20, 2020 12:35 PM<br>
                <b>To:</b> Topper, Craig <a
                  href="mailto:craig.topper@intel.com"
                  moz-do-not-send="true"><craig.topper@intel.com></a>;
                Kaylor, Andrew
                <a href="mailto:andrew.kaylor@intel.com"
                  moz-do-not-send="true"><andrew.kaylor@intel.com></a>;
                Luo, Yuanke
                <a href="mailto:yuanke.luo@intel.com"
                  moz-do-not-send="true"><yuanke.luo@intel.com></a>;
                Philip Reames <a
                  href="mailto:listmail@philipreames.com"
                  moz-do-not-send="true">
                  <listmail@philipreames.com></a>; <a
                  href="mailto:llvm-dev@lists.llvm.org"
                  moz-do-not-send="true">llvm-dev@lists.llvm.org</a>;
                <a href="mailto:florian_hahn@apple.com"
                  moz-do-not-send="true">florian_hahn@apple.com</a>; Lu,
                Hongjiu <a href="mailto:hongjiu.lu@intel.com"
                  moz-do-not-send="true">
                  <hongjiu.lu@intel.com></a><br>
                <b>Subject:</b> Re: [llvm-dev] Intel AMX programming
                model discussion.<o:p></o:p></p>
            </div>
          </div>
          <p class="MsoNormal"> <o:p></o:p></p>
          <p> <o:p></o:p></p>
          <div>
            <p class="MsoNormal">On 8/19/20 3:09 PM, Topper, Craig
              wrote:<o:p></o:p></p>
          </div>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoNormal">The width and height can be runtime
              values that we would just copy into 64 byte configuration
              block we pass to ldtilecfg. So the code doesn’t need to be
              multiversioned. The user code would also use those values
              to update pointers in the loops they write using the
              tiles. If we can’t determine that two tiles were defined
              with the same width and height we need to assume the shape
              is different and try to avoid ever giving the same tile.<o:p></o:p></p>
            <p class="MsoNormal">Hal, for your suggestion would which
              physical registers are in which register class be defined
              dynamically before register allocation?<o:p></o:p></p>
          </blockquote>
          <p> <o:p></o:p></p>
          <p>Here's my thought:<o:p></o:p></p>
          <p>First, you have a set of intrinsics that take tile values
            along with tile configuration parameters (which, presently,
            seem just to be the sizes). These get lowered into
            pseudo-instructions that do the same. Thus, you have some
            register class that represents these arbitrarily-sized tile
            registers that you'll assign to these pseudo-instruction
            operands (i.e., they take virtual tile registers right after
            instruction selection). You might use the 16x16 tile
            register class for this purpose, but it shouldn't really
            matter.<o:p></o:p></p>
          <p>Second, you run this configuration-placement pass. This
            pass looks at all of the AMX pseudo-instructions and
            identifies regions in which the pseudo-instructions use the
            same configuration parameters (i.e., the same SSA values
            and/or constants). This pass might reorder the
            pseudo-instructions when legal in order to form larger
            regions. Then it places the ldtilecfg at the start of each
            region (in some common dominating position). ldtilecfg
            implicitly defines all of the tile registers in every
            concrete class of tile registers (all 256 of them, or
            whatever). The pseudo-instructions are replaced by real MI
            instructions taking a tile register class appropriate for
            the configuration (which will default to the 16x16 class for
            cases where the configuration is not a compile-time-known
            constant). When the configuration is a known constant, the
            instructions take operands with a register class appropriate
            for that configuration (e.g., 1x1, 4x4).<o:p></o:p></p>
          <p>Third, the rest of the framework runs as usual. Tile
            registers from the appropriate class are allocated by the
            register allocator. No live range of any virtual tile
            register can pass through the ldtilecfg (because it defines
            them all), but that's okay, none of live ranges will by
            construction (the configuration-placement pass ensures
            this).<o:p></o:p></p>
          <p>Thanks again,<o:p></o:p></p>
          <p>Hal<o:p></o:p></p>
          <p> <o:p></o:p></p>
          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
            <p class="MsoNormal"> <o:p></o:p></p>
            <div>
              <div style="border:none;border-top:solid #E1E1E1
                1.0pt;padding:3.0pt 0in 0in 0in">
                <p class="MsoNormal"
                  style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
                  <b>From:</b> Hal Finkel <a
                    href="mailto:hfinkel@anl.gov" moz-do-not-send="true"><hfinkel@anl.gov></a>
                  <br>
                  <b>Sent:</b> Wednesday, August 19, 2020 12:52 PM<br>
                  <b>To:</b> Kaylor, Andrew <a
                    href="mailto:andrew.kaylor@intel.com"
                    moz-do-not-send="true"><andrew.kaylor@intel.com></a>;
                  Luo, Yuanke
                  <a href="mailto:yuanke.luo@intel.com"
                    moz-do-not-send="true"><yuanke.luo@intel.com></a>;
                  Philip Reames <a
                    href="mailto:listmail@philipreames.com"
                    moz-do-not-send="true">
                    <listmail@philipreames.com></a>; <a
                    href="mailto:llvm-dev@lists.llvm.org"
                    moz-do-not-send="true">llvm-dev@lists.llvm.org</a>;
                  <a href="mailto:florian_hahn@apple.com"
                    moz-do-not-send="true">florian_hahn@apple.com</a>;
                  Topper, Craig
                  <a href="mailto:craig.topper@intel.com"
                    moz-do-not-send="true"><craig.topper@intel.com></a>;
                  Lu, Hongjiu
                  <a href="mailto:hongjiu.lu@intel.com"
                    moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
                  <b>Subject:</b> Re: [llvm-dev] Intel AMX programming
                  model discussion.<o:p></o:p></p>
              </div>
            </div>
            <p class="MsoNormal"> <o:p></o:p></p>
            <p> <o:p></o:p></p>
            <div>
              <p class="MsoNormal">On 8/19/20 10:24 AM, Kaylor, Andrew
                wrote:<o:p></o:p></p>
            </div>
            <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
              <p>> When the tile shape is unknown at compile time,
                how do you plan to do the register allocation of the
                tiles? My question is: do you do the allocation for this
                case in the same way as you would if you knew the size
                was 16x16 (i.e., conservatively assume the largest
                size)?<o:p></o:p></p>
              <p class="MsoNormal">I think what will happen is that the
                registers are allocated based on a number of runtime
                values that are assumed to be different from one another
                but less than or equal to 16. So, for example, we’ll
                allocate registers for MxN tiles, NxM tiles and MxM
                tiles without knowing what M and N are. Then at runtime
                the values of these variables will be used to create the
                actual tile configuration. The instructions that need to
                know the shape take these runtime values as operands.<o:p></o:p></p>
            </blockquote>
            <p> <o:p></o:p></p>
            <p>So you're going to multiversion the code?<o:p></o:p></p>
            <p>In any case, my point is that you probably don't need a
              custom register allocator. If you just define the tile
              registers and make sure that the ldtilecfgs implicitly
              defines them all, then the regular infrastructure likely
              works. You'll have a bunch of register classes, but that's
              not necessarily a problem. I recommend trying this, and
              let us know what you discover, before we go down the road
              of a new, dedicated allocator just for these registers.<o:p></o:p></p>
            <p> -Hal<o:p></o:p></p>
            <p> <o:p></o:p></p>
            <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
              <p class="MsoNormal">There may be some artifacts coming
                from the front end that conservatively assume a 16x16
                tile, but I think those generally go away in SROA or
                later specialized passes. Yuanke can confirm or correct
                my understanding of this.<o:p></o:p></p>
              <p class="MsoNormal"> <o:p></o:p></p>
              <div>
                <div style="border:none;border-top:solid #E1E1E1
                  1.0pt;padding:3.0pt 0in 0in 0in">
                  <p class="MsoNormal"
                    style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
                    <b>From:</b> Hal Finkel <a
                      href="mailto:hfinkel@anl.gov"
                      moz-do-not-send="true"><hfinkel@anl.gov></a>
                    <br>
                    <b>Sent:</b> Wednesday, August 19, 2020 5:14 AM<br>
                    <b>To:</b> Luo, Yuanke <a
                      href="mailto:yuanke.luo@intel.com"
                      moz-do-not-send="true"><yuanke.luo@intel.com></a>;
                    Kaylor, Andrew
                    <a href="mailto:andrew.kaylor@intel.com"
                      moz-do-not-send="true"><andrew.kaylor@intel.com></a>;
                    Philip Reames
                    <a href="mailto:listmail@philipreames.com"
                      moz-do-not-send="true"><listmail@philipreames.com></a>;
                    <a href="mailto:llvm-dev@lists.llvm.org"
                      moz-do-not-send="true">
                      llvm-dev@lists.llvm.org</a>; <a
                      href="mailto:florian_hahn@apple.com"
                      moz-do-not-send="true">florian_hahn@apple.com</a>;
                    Topper, Craig
                    <a href="mailto:craig.topper@intel.com"
                      moz-do-not-send="true"><craig.topper@intel.com></a>;
                    Lu, Hongjiu
                    <a href="mailto:hongjiu.lu@intel.com"
                      moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
                    <b>Subject:</b> Re: [llvm-dev] Intel AMX programming
                    model discussion.<o:p></o:p></p>
                </div>
              </div>
              <p class="MsoNormal"> <o:p></o:p></p>
              <p> <o:p></o:p></p>
              <div>
                <p class="MsoNormal">On 8/19/20 5:34 AM, Luo, Yuanke
                  wrote:<o:p></o:p></p>
              </div>
              <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
                <p class="MsoNormal">There is no problem to have 256
                  register classes. Just a lot of register classes to
                  me.<o:p></o:p></p>
                <p class="MsoNormal">We don’t assume the shape of each
                  physical register be 16x16, it is defined by user. For
                  variable shape, I mean the shape is known in runtime
                  and in compile time the shape is unknown. Take below
                  code as an example, the %row and %col are variable
                  instead of constant. Compiler recognizes
                  llvm.x86.tileloadd64 and deduce the shape of %0 is
                  %row x %col.<o:p></o:p></p>
                <p class="MsoNormal">%0 = tail call <256 x i32>
                  @llvm.x86.tileloadd64(i16 %row, i16 %col, i8*
                  getelementptr inbounds ([1024 x i8], [1024 x i8]*
                  @buf, i64 0, i64 0), i64 32)<o:p></o:p></p>
              </blockquote>
              <p> <o:p></o:p></p>
              <p>When the tile shape is unknown at compile time, how do
                you plan to do the register allocation of the tiles? My
                question is: do you do the allocation for this case in
                the same way as you would if you knew the size was 16x16
                (i.e., conservatively assume the largest size)?<o:p></o:p></p>
              <p>Thanks again,<o:p></o:p></p>
              <p>Hal<o:p></o:p></p>
              <p> <o:p></o:p></p>
              <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
                <p class="MsoNormal"> <o:p></o:p></p>
                <div>
                  <div style="border:none;border-top:solid #E1E1E1
                    1.0pt;padding:3.0pt 0in 0in 0in">
                    <p class="MsoNormal"
                      style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
                      <b>From:</b> Hal Finkel <a
                        href="mailto:hfinkel@anl.gov"
                        moz-do-not-send="true"><hfinkel@anl.gov></a>
                      <br>
                      <b>Sent:</b> Wednesday, August 19, 2020 4:58 PM<br>
                      <b>To:</b> Luo, Yuanke <a
                        href="mailto:yuanke.luo@intel.com"
                        moz-do-not-send="true"><yuanke.luo@intel.com></a>;
                      Kaylor, Andrew
                      <a href="mailto:andrew.kaylor@intel.com"
                        moz-do-not-send="true"><andrew.kaylor@intel.com></a>;
                      Philip Reames
                      <a href="mailto:listmail@philipreames.com"
                        moz-do-not-send="true"><listmail@philipreames.com></a>;
                      <a href="mailto:llvm-dev@lists.llvm.org"
                        moz-do-not-send="true">
                        llvm-dev@lists.llvm.org</a>; <a
                        href="mailto:florian_hahn@apple.com"
                        moz-do-not-send="true">florian_hahn@apple.com</a>;
                      Topper, Craig
                      <a href="mailto:craig.topper@intel.com"
                        moz-do-not-send="true"><craig.topper@intel.com></a>;
                      Lu, Hongjiu
                      <a href="mailto:hongjiu.lu@intel.com"
                        moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
                      <b>Subject:</b> Re: [llvm-dev] Intel AMX
                      programming model discussion.<o:p></o:p></p>
                  </div>
                </div>
                <p class="MsoNormal"> <o:p></o:p></p>
                <p> <o:p></o:p></p>
                <div>
                  <p class="MsoNormal">On 8/19/20 2:21 AM, Luo, Yuanke
                    wrote:<o:p></o:p></p>
                </div>
                <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
                  <p class="MsoNormal"> <o:p></o:p></p>
                  <p class="MsoNormal">Hi Hal,<o:p></o:p></p>
                  <p class="MsoNormal">There is 3 aspect to be solved. <o:p></o:p></p>
                  <p class="MsoListParagraph"
                    style="margin-left:.5in;text-indent:-.25in;mso-list:l1
                    level1 lfo2">
                    <!--[if !supportLists]--><span
                      style="mso-list:Ignore">1.<span style="font:7.0pt
                        "Times New Roman"">      
                      </span></span><!--[endif]-->The HW support max
                    shape 16x16, so there are many register classes from
                    1x1 to 16x16. We need 256 register classes.
                    <o:p></o:p></p>
                  <p class="MsoListParagraph"
                    style="margin-left:.5in;text-indent:-.25in;mso-list:l1
                    level1 lfo2">
                    <!--[if !supportLists]--><span
                      style="mso-list:Ignore">2.<span style="font:7.0pt
                        "Times New Roman"">      
                      </span></span><!--[endif]-->We want to support
                    variable shape, so compiler don’t know what register
                    class to fit tile shape as it is only known in
                    runtime.<o:p></o:p></p>
                  <p class="MsoListParagraph"
                    style="margin-left:.5in;text-indent:-.25in;mso-list:l1
                    level1 lfo2">
                    <!--[if !supportLists]--><span
                      style="mso-list:Ignore">3.<span style="font:7.0pt
                        "Times New Roman"">      
                      </span></span><!--[endif]-->The tile configure is
                    to configure physical tile register, so we need to
                    allocate register and then we know the shape of each
                    physical tile register and configure the tile
                    register.<o:p></o:p></p>
                  <p class="MsoNormal">I think your suggestion is
                    helpful to reduce the complexity if we only support
                    fixed (constant) tile shape.<o:p></o:p></p>
                  <p class="MsoNormal">-Yuanke<o:p></o:p></p>
                </blockquote>
                <p> <o:p></o:p></p>
                <p>Thanks, Yuanke.<o:p></o:p></p>
                <p>It's not clear to me that having 256 register classes
                  is, in itself, a problem. Is it?<o:p></o:p></p>
                <p>What does it mean to support variable-shape tiles in
                  this context? Do you do something other than
                  conservatively assume that they are 16x16 for
                  register-allocation purposes?<o:p></o:p></p>
                <p> -Hal<o:p></o:p></p>
                <p> <o:p></o:p></p>
                <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
                  <p class="MsoNormal"> <o:p></o:p></p>
                  <div>
                    <div style="border:none;border-top:solid #E1E1E1
                      1.0pt;padding:3.0pt 0in 0in 0in">
                      <p class="MsoNormal"
                        style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
                        <b>From:</b> Hal Finkel <a
                          href="mailto:hfinkel@anl.gov"
                          moz-do-not-send="true"><hfinkel@anl.gov></a>
                        <br>
                        <b>Sent:</b> Wednesday, August 19, 2020 8:20 AM<br>
                        <b>To:</b> Kaylor, Andrew <a
                          href="mailto:andrew.kaylor@intel.com"
                          moz-do-not-send="true"><andrew.kaylor@intel.com></a>;
                        Philip Reames
                        <a href="mailto:listmail@philipreames.com"
                          moz-do-not-send="true"><listmail@philipreames.com></a>;
                        Luo, Yuanke
                        <a href="mailto:yuanke.luo@intel.com"
                          moz-do-not-send="true"><yuanke.luo@intel.com></a>;
                        <a href="mailto:llvm-dev@lists.llvm.org"
                          moz-do-not-send="true">
                          llvm-dev@lists.llvm.org</a>; <a
                          href="mailto:florian_hahn@apple.com"
                          moz-do-not-send="true">florian_hahn@apple.com</a>;
                        Topper, Craig
                        <a href="mailto:craig.topper@intel.com"
                          moz-do-not-send="true"><craig.topper@intel.com></a>;
                        Lu, Hongjiu
                        <a href="mailto:hongjiu.lu@intel.com"
                          moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
                        <b>Subject:</b> Re: [llvm-dev] Intel AMX
                        programming model discussion.<o:p></o:p></p>
                    </div>
                  </div>
                  <p class="MsoNormal"> <o:p></o:p></p>
                  <p>Hi, Andy,<o:p></o:p></p>
                  <p>I don't quite understand everything that's going on
                    here. Could we model this as:<o:p></o:p></p>
                  <p> 1. Define a collection of register classes, one
                    for 2x4 tiles, one for 4x2 tiles, etc. each
                    populated with a set of tile registers. Registers
                    can have aliasing relationships (instead of worrying
                    of any kind of subregister/superregister
                    relationships -- these won't be useful anyway).<o:p></o:p></p>
                  <p> 2. Define the tile-configuration instructions so
                    that they implicitly define all of the registers in
                    all of the classes.<o:p></o:p></p>
                  <p>Then you would still need to pre-schedule the tile
                    operations as you've described, and collect the
                    configuration information in order to add the
                    ldtilecfgs, but the regular register allocator can
                    handle the allocation itself in the usual way. What
                    do you think?<o:p></o:p></p>
                  <p> -Hal<o:p></o:p></p>
                  <div>
                    <p class="MsoNormal">On 8/18/20 6:58 PM, Kaylor,
                      Andrew via llvm-dev wrote:<o:p></o:p></p>
                  </div>
                  <blockquote
                    style="margin-top:5.0pt;margin-bottom:5.0pt">
                    <p class="MsoNormal"
                      style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
                      The AMX registers are complicated. The single
                      configuration register (which is mostly used
                      implicitly, similar to MXCSR for floating point)
                      controls the shape of all the tile registers, and
                      if you change the tile configuration every single
                      tile register is cleared. In practice, if we have
                      to change the the configuration while any of the
                      tile registers are live, performance is going to
                      be terrible. We need to handle this case for
                      correctness, but users of this programming
                      interface will need to have enough awareness of
                      the performance issues and the hardware details to
                      prevent this. We’ll also want a diagnostic that
                      lets the user know when this has happened.<o:p></o:p></p>
                    <p class="MsoNormal"
                      style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
                       <o:p></o:p></p>
                    <p class="MsoNormal"
                      style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
                      When the tile configuration is set, the shape of
                      each tile is locked in, so the individual tile
                      registers aren’t interchangeable at that point. If
                      a function needs 2x4 tiles, 4x2 tiles, and 4x4
                      tiles, the configuration needs to be set with this
                      in mind. The shape isn’t explicit in every
                      instruction and intrinsic. It must be deduced. And
                      again, we’ll need a way to tell the user when
                      efficient allocation can’t be done. In practice, I
                      don’t expect any function to be using more than
                      three tile shapes.<o:p></o:p></p>
                    <p class="MsoNormal"
                      style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
                       <o:p></o:p></p>
                    <p class="MsoNormal"
                      style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
                      The implication of all this is that I don’t think
                      the greedy register allocator is well suited to
                      figure all of this out. We need a special pass to
                      pre-allocate these registers. If the function is
                      written in a way that makes good performance
                      possible, it should be a relatively simple task to
                      allocate everything with minimal spilling. If it
                      isn’t possible to get good performance, we don’t
                      need to do anything especially clever. We can just
                      do something straightforward that is correct and
                      let the user know that they aren’t going to be
                      happy with the results.<o:p></o:p></p>
                    <p class="MsoNormal"> <o:p></o:p></p>
                    <p class="MsoNormal">-Andy<o:p></o:p></p>
                    <p class="MsoNormal"> <o:p></o:p></p>
                    <div>
                      <div style="border:none;border-top:solid #E1E1E1
                        1.0pt;padding:3.0pt 0in 0in 0in">
                        <p class="MsoNormal"
                          style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
                          <b>From:</b> Philip Reames <a
                            href="mailto:listmail@philipreames.com"
                            moz-do-not-send="true"><listmail@philipreames.com></a>
                          <br>
                          <b>Sent:</b> Friday, August 14, 2020 8:29 PM<br>
                          <b>To:</b> Luo, Yuanke <a
                            href="mailto:yuanke.luo@intel.com"
                            moz-do-not-send="true"><yuanke.luo@intel.com></a>;
                          <a href="mailto:llvm-dev@lists.llvm.org"
                            moz-do-not-send="true">llvm-dev@lists.llvm.org</a>;
                          <a href="mailto:florian_hahn@apple.com"
                            moz-do-not-send="true">
                            florian_hahn@apple.com</a>; Kaylor, Andrew <a
                            href="mailto:andrew.kaylor@intel.com"
                            moz-do-not-send="true">
                            <andrew.kaylor@intel.com></a>; Topper,
                          Craig <a href="mailto:craig.topper@intel.com"
                            moz-do-not-send="true">
                            <craig.topper@intel.com></a>; Lu,
                          Hongjiu <a href="mailto:hongjiu.lu@intel.com"
                            moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
                          <b>Subject:</b> Re: [llvm-dev] Intel AMX
                          programming model discussion.<o:p></o:p></p>
                      </div>
                    </div>
                    <p class="MsoNormal"> <o:p></o:p></p>
                    <p>I find your answer unconvincing.  I'm not going
                      to debate it as I don't wish to take the time to
                      build the appropriate context, but my initial
                      response is skepticism.<o:p></o:p></p>
                    <p>Philip<o:p></o:p></p>
                    <div>
                      <p class="MsoNormal">On 8/14/20 4:49 PM, Luo,
                        Yuanke wrote:<o:p></o:p></p>
                    </div>
                    <blockquote
                      style="margin-top:5.0pt;margin-bottom:5.0pt">
                      <p class="MsoNormal">[Yuanke] AMX register is
                        special. It needs to be configured before use
                        and the config instruction is expensive. To
                        avoid unnecessary tile configure, we collect the
                        tile shape information as much as possible and
                        combine them into one ldtilecfg instruction. The
                        ldtilecfg instruction should dominate any AMX
                        instruction that access tile register. On the
                        other side, the ldtilecfg should post-dominated
                        the instruction that define the tile shape. For
                        tile register spill, it should avoid re-config
                        due to the different tile shape, the spilled
                        register should be reloaded to the register that
                        share the same tile shape. Since tile register
                        allocation is special and it may allocate
                        general virtual register to configure tile
                        register, we can add a sperate pass to do it
                        before general register allocation pass. After
                        register allocation, the tile shape information
                        is not needed anymore, so we can transform the
                        pseudo AMX instruction to real AMX instruction
                        by removing the row and column operands.<o:p></o:p></p>
                      <p>[Philip]<o:p></o:p></p>
                      <p>This seems complicated.<o:p></o:p></p>
                      <p>Reading through the documentation, there
                        appears to be a single global tile config for
                        all tile registers at any time.<o:p></o:p></p>
                      <p>Why not simply model this tile config as a
                        designated special register and the tile
                        instructions as having an implicit use of this
                        register?  That would seem to ensure that the
                        register allocator has all the constraints
                        needed.  You'd need to teach it how to spill the
                        special registers with the appropriate
                        instructions, but that seems a lot more straight
                        forward?<o:p></o:p></p>
                      <p class="MsoNormal"><span
                          style="font-size:10.5pt;line-height:105%">[Yuanke]
                          In that case user need to configure the tile
                          register by themselves. Spilling configure
                          register is very expensive, because it clears
                          all the tile data register to zero. In our
                          proposal, compiler is responsible to deduce
                          the shape for virtual of tile data register,
                          allocate physical registers for them and then
                          configure those physical register. We may
                          build the dependency as you proposed and it
                          can be used for machine IR check to ensure
                          tile data register is configured before use. </span><o:p></o:p></p>
                      <p class="MsoNormal"><span
                          style="font-size:10.5pt;line-height:105%"> </span><o:p></o:p></p>
                      <div>
                        <div style="border:none;border-top:solid #E1E1E1
                          1.0pt;padding:3.0pt 0in 0in 0in">
                          <p class="MsoNormal"
                            style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
                            <b>From:</b> Philip Reames <a
                              href="mailto:listmail@philipreames.com"
                              moz-do-not-send="true"><listmail@philipreames.com></a>
                            <br>
                            <b>Sent:</b> Saturday, August 15, 2020 1:17
                            AM<br>
                            <b>To:</b> Luo, Yuanke <a
                              href="mailto:yuanke.luo@intel.com"
                              moz-do-not-send="true"><yuanke.luo@intel.com></a>;
                            <a href="mailto:llvm-dev@lists.llvm.org"
                              moz-do-not-send="true">llvm-dev@lists.llvm.org</a>;
                            <a href="mailto:florian_hahn@apple.com"
                              moz-do-not-send="true">
                              florian_hahn@apple.com</a>; Kaylor, Andrew
                            <a href="mailto:andrew.kaylor@intel.com"
                              moz-do-not-send="true">
                              <andrew.kaylor@intel.com></a>;
                            Topper, Craig <a
                              href="mailto:craig.topper@intel.com"
                              moz-do-not-send="true">
                              <craig.topper@intel.com></a>; Lu,
                            Hongjiu <a
                              href="mailto:hongjiu.lu@intel.com"
                              moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>
                            <b>Subject:</b> Re: [llvm-dev] Intel AMX
                            programming model discussion.<o:p></o:p></p>
                        </div>
                      </div>
                      <p class="MsoNormal"> <o:p></o:p></p>
                      <p> <o:p></o:p></p>
                      <div>
                        <p class="MsoNormal">On 8/14/20 6:27 AM, Luo,
                          Yuanke via llvm-dev wrote:<o:p></o:p></p>
                      </div>
                      <blockquote
                        style="margin-top:5.0pt;margin-bottom:5.0pt">
                        <p class="MsoNormal">Hi,<o:p></o:p></p>
                        <p class="MsoNormal">Intel Advanced Matrix
                          Extensions (Intel AMX) is a new programming
                          paradigm consisting of two components: a set
                          of 2-dimensional registers (tiles)
                          representing sub-arrays from a larger
                          2-dimensional memory image, and accelerators
                          able to operate on tiles. Capability of Intel
                          AMX implementation is enumerated by palettes.
                          Two palettes are supported: palette 0
                          represents the initialized state and palette 1
                          consists of 8 tile registers of up to 1 KB
                          size, which is controlled by a tile control
                          register.<o:p></o:p></p>
                        <p class="MsoNormal">The instruction manual is
                          posted at <a
href="https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html"
                            moz-do-not-send="true">
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html</a>.<o:p></o:p></p>
                        <p class="MsoNormal">The AMX abi proposal is
                          posted at <a
                            href="https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4"
                            moz-do-not-send="true">
https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4</a>.<o:p></o:p></p>
                        <p class="MsoNormal">This email is to discuss
                          the programming model for AMX. Florian has
                          introduced the matrix type and intrinsics in
                          LLVM community. We’d like to adopt some ideas
                          from it.<o:p></o:p></p>
                        <p class="MsoNormal">Here is what we propose for
                          the AMX programming model.<o:p></o:p></p>
                        <p class="MsoListParagraph"
                          style="margin-left:.5in;text-indent:-.25in;mso-list:l0
                          level1 lfo4">
                          <!--[if !supportLists]--><span
                            style="mso-list:Ignore">1.<span
                              style="font:7.0pt "Times New
                              Roman"">      
                            </span></span><!--[endif]--> Data type. <o:p></o:p></p>
                        <p class="MsoNormal">We’d like to have fixed
                          vector type for AMX. Since the shape to AMX
                          register can be configurable, the vector size
                          is the maximum size of AMX register. That
                          means the vector size is 1024 bytes.<o:p></o:p></p>
                        <p class="MsoNormal">The C code may look like
                          this.<o:p></o:p></p>
                        <p class="MsoNormal">typedef int _tile_data
                          __attribute__((__vector_size__(1024),
                          __aligned__(64)));<o:p></o:p></p>
                        <p class="MsoNormal">_tile_data tile;<o:p></o:p></p>
                        <p class="MsoNormal">And the LLVM IR may look
                          like this.<o:p></o:p></p>
                        <p class="MsoNormal">@tile = dso_local
                          local_unnamed_addr global <256 x i32>
                          zeroinitializer, align 64<o:p></o:p></p>
                        <p class="MsoNormal">For llvm IR, it is nice to
                          have a new type x86_amxtile that can be mapped
                          to AMX registers.<o:p></o:p></p>
                        <p class="MsoListParagraph"
                          style="margin-left:.5in;text-indent:-.25in;mso-list:l0
                          level1 lfo4">
                          <!--[if !supportLists]--><span
                            style="mso-list:Ignore">2.<span
                              style="font:7.0pt "Times New
                              Roman"">      
                            </span></span><!--[endif]-->AMX Intrinsics.
                          <o:p></o:p></p>
                        <p class="MsoNormal">The internal intrinsics are
                          1:1 mapped to AMX instructions. The parameter
                          m, n, k identifies the shape of the tile. The
                          shape can be variable, but it cannot exceed
                          the size that AMX HW can support. Compiler can
                          deduce shape of the tile from the AMX
                          intrinsics.<o:p></o:p></p>
                        <p class="MsoNormal" style="text-indent:5.5pt">_tile_data
                          _tile_loadd_internal(char m, short n, const
                          void *base, int stride);<o:p></o:p></p>
                        <p class="MsoNormal">_tile_data
                          _tile_dpbssd_internal(char m, short n, short
                          k, _tile_data dst, _tile_data src1, _tile_data
                          src2);<o:p></o:p></p>
                        <p class="MsoNormal">_tile_data
                          _tile_dpbf16ps_internal(char m, short n, short
                          k, _tile_data dst, _tile_data src1, _tile_data
                          src2);<o:p></o:p></p>
                        <p class="MsoNormal">void
                          _tile_stored_internal(char m, short n, void
                          *base, int stride, _tile_data tile);<o:p></o:p></p>
                        <p class="MsoListParagraph"
                          style="margin-left:.5in;text-indent:-.25in;mso-list:l0
                          level1 lfo4">
                          <!--[if !supportLists]--><span
                            style="mso-list:Ignore">3.<span
                              style="font:7.0pt "Times New
                              Roman"">      
                            </span></span><!--[endif]-->User interfaces.<o:p></o:p></p>
                        <p class="MsoNormal">The tile shape and tile
                          data are combined into a struct in C language.
                          The shape of the tile is only allowed to be
                          initialized once. The user interface looks as
                          this.<o:p></o:p></p>
                        <p class="MsoNormal">   3  #define
                          __DEFAULT_FN_AMX    \<o:p></o:p></p>
                        <p class="MsoNormal">   4 
                          __attribute__((__always_inline__, __nodebug__,
                          __target__("amx-int8")))<o:p></o:p></p>
                        <p class="MsoNormal">   9 typedef struct
                          __tile_str {<o:p></o:p></p>
                        <p class="MsoNormal">10   const char row;<o:p></o:p></p>
                        <p class="MsoNormal">11   const short col;<o:p></o:p></p>
                        <p class="MsoNormal">12   _tile_data tile;<o:p></o:p></p>
                        <p class="MsoNormal">13 }__tile;<o:p></o:p></p>
                        <p class="MsoNormal">14<o:p></o:p></p>
                        <p class="MsoNormal">15 __DEFAULT_FN_AMX<o:p></o:p></p>
                        <p class="MsoNormal">16 void __tile_loadd(__tile
                          *dst, const void *base, long stride) {<o:p></o:p></p>
                        <p class="MsoNormal">17   dst->tile =
                          _tile_loadd_internal(dst->row, dst->col,
                          base, stride);<o:p></o:p></p>
                        <p class="MsoNormal">18 }<o:p></o:p></p>
                        <p class="MsoNormal">19<o:p></o:p></p>
                        <p class="MsoNormal">20 __DEFAULT_FN_AMX<o:p></o:p></p>
                        <p class="MsoNormal">21 void
                          __tile_dpbsud(__tile *dst, __tile src1, __tile
                          src2) {<o:p></o:p></p>
                        <p class="MsoNormal">22   dst->tile =
                          _tile_dpbssd_internal(src1.row, src2.col,
                          src1.col, dst->tile, src1.tile, src2.tile);<o:p></o:p></p>
                        <p class="MsoNormal">23 }<o:p></o:p></p>
                        <p class="MsoNormal">24<o:p></o:p></p>
                        <p class="MsoNormal">25 __DEFAULT_FN_AMX<o:p></o:p></p>
                        <p class="MsoNormal">26 void __tile_stored(void
                          *base, long stride, __tile src) {<o:p></o:p></p>
                        <p class="MsoNormal">27  
                          _tile_stored_internal(src.row, src.col, base,
                          stride, src.tile);<o:p></o:p></p>
                        <p class="MsoNormal">28 }<o:p></o:p></p>
                        <p class="MsoNormal"> <o:p></o:p></p>
                        <p class="MsoListParagraph"
                          style="margin-left:.5in;text-indent:-.25in;mso-list:l0
                          level1 lfo4">
                          <!--[if !supportLists]--><span
                            style="mso-list:Ignore">4.<span
                              style="font:7.0pt "Times New
                              Roman"">      
                            </span></span><!--[endif]-->Example code<o:p></o:p></p>
                        <p class="MsoNormal">The example shows how to
                          use the user interface in a function.
                          <o:p></o:p></p>
                        <p class="MsoNormal"> 51 void api(int cond,
                          short row, short col) {<o:p></o:p></p>
                        <p class="MsoNormal">52   __tile a = {row, col};<o:p></o:p></p>
                        <p class="MsoNormal">53   __tile b = {row, col};<o:p></o:p></p>
                        <p class="MsoNormal">54   __tile c = {row, col};<o:p></o:p></p>
                        <p class="MsoNormal">55<o:p></o:p></p>
                        <p class="MsoNormal">56   if(cond) {<o:p></o:p></p>
                        <p class="MsoNormal">57     __tile_loadd(&a,
                          buf, STRIDE);<o:p></o:p></p>
                        <p class="MsoNormal">58     __tile_loadd(&b,
                          buf, STRIDE);<o:p></o:p></p>
                        <p class="MsoNormal">59     __tile_loadd(&c,
                          buf, STRIDE);<o:p></o:p></p>
                        <p class="MsoNormal">60   } else {<o:p></o:p></p>
                        <p class="MsoNormal">61     __tile_loadd(&a,
                          buf2, STRIDE);<o:p></o:p></p>
                        <p class="MsoNormal">62     __tile_loadd(&b,
                          buf2, STRIDE);<o:p></o:p></p>
                        <p class="MsoNormal">63     __tile_loadd(&c,
                          buf2, STRIDE);<o:p></o:p></p>
                        <p class="MsoNormal">64   }<o:p></o:p></p>
                        <p class="MsoNormal"><span lang="IT">65  
                            __tile_dpbsud(&c, a, b);</span><o:p></o:p></p>
                        <p class="MsoNormal">66   __tile_stored(buf,
                          STRIDE, c);<o:p></o:p></p>
                        <p class="MsoNormal">67 }<o:p></o:p></p>
                        <p class="MsoListParagraph"
                          style="margin-left:.5in;text-indent:-.25in;mso-list:l0
                          level1 lfo4">
                          <!--[if !supportLists]--><span
                            style="mso-list:Ignore">5.<span
                              style="font:7.0pt "Times New
                              Roman"">      
                            </span></span><!--[endif]-->LLVM IR<o:p></o:p></p>
                        <p class="MsoNormal">The LLVM intrinsics IR take
                          the row and column information as the input
                          parameter, so that compiler can deduce the
                          shape of tile data. The remaining parameters
                          are what AMX instructions require. This is the
                          LLVM IR corresponding to the example code.<o:p></o:p></p>
                        <p class="MsoNormal">12 define dso_local void
                          @api(i32 %cond, i16 signext %row, i16 signext
                          %col) local_unnamed_addr #2 {<o:p></o:p></p>
                        <p class="MsoNormal">13 entry:<o:p></o:p></p>
                        <p class="MsoNormal">14   %tobool = icmp eq i32
                          %cond, 0<o:p></o:p></p>
                        <p class="MsoNormal">15   %sext = shl i16 %col,
                          8<o:p></o:p></p>
                        <p class="MsoNormal">16   %conv.i31 = ashr exact
                          i16 %sext, 8<o:p></o:p></p>
                        <p class="MsoNormal">17   br i1 %tobool, label
                          %if.else, label %if.then<o:p></o:p></p>
                        <p class="MsoNormal">18<o:p></o:p></p>
                        <p class="MsoNormal">19
                          if.then:                                         
                          ; preds = %entry<o:p></o:p></p>
                        <p class="MsoNormal">20   %0 = tail call <256
                          x i32> @llvm.x86.tileloadd64(i16 %row, i16
                          %conv.i31, i8* getelementptr inbounds ([1024 x
                          i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)
                          #3<o:p></o:p></p>
                        <p class="MsoNormal">21   %1 = tail call <256
                          x i32> @llvm.x86.tileloadd64(i16 %row, i16
                          %conv.i31, i8* getelementptr inbounds ([1024 x
                          i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)
                          #3<o:p></o:p></p>
                        <p class="MsoNormal">22   %2 = tail call <256
                          x i32> @llvm.x86.tileloadd64(i16 %row, i16
                          %conv.i31, i8* getelementptr inbounds ([1024 x
                          i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)
                          #3<o:p></o:p></p>
                        <p class="MsoNormal">23   br label %if.end<o:p></o:p></p>
                        <p class="MsoNormal">24<o:p></o:p></p>
                        <p class="MsoNormal">25
                          if.else:                     
                                              ; preds = %entry<o:p></o:p></p>
                        <p class="MsoNormal">26   %3 = tail call <256
                          x i32> @llvm.x86.tileloadd64(i16 %row, i16
                          %conv.i31, i8* getelementptr inbounds ([1024 x
                          i8], [1024 x i8]* @buf2, i64 0, i64 0), i64
                          32) #3<o:p></o:p></p>
                        <p class="MsoNormal">27   %4 = tail call <256
                          x i32> @llvm.x86.tileloadd64(i16 %row, i16
                          %conv.i31, i8* getelementptr inbounds ([1024 x
                          i8], [1024 x i8]* @buf2, i64 0, i64 0), i64
                          32) #3<o:p></o:p></p>
                        <p class="MsoNormal">28   %5 = tail call <256
                          x i32> @llvm.x86.tileloadd64(i16 %row, i16
                          %conv.i31, i8* getelementptr inbounds ([1024 x
                          i8], [1024 x i8]* @buf2, i64 0, i64 0), i64
                          32) #3<o:p></o:p></p>
                        <p class="MsoNormal">29   br label %if.end<o:p></o:p></p>
                        <p class="MsoNormal">30<o:p></o:p></p>
                        <p class="MsoNormal">31
                          if.end:                                          
                          ; preds = %if.else, %if.then<o:p></o:p></p>
                        <p class="MsoNormal">32   %a.sroa.1186.0 = phi
                          <256 x i32> [ %3, %if.else ], [ %0,
                          %if.then ]<o:p></o:p></p>
                        <p class="MsoNormal">33   %b.sroa.1068.0 = phi
                          <256 x i32> [ %4, %if.else ], [ %1,
                          %if.then ]<o:p></o:p></p>
                        <p class="MsoNormal">34   %c.sroa.1149.0 = phi
                          <256 x i32> [ %5, %if.else ], [ %2,
                          %if.then ]<o:p></o:p></p>
                        <p class="MsoNormal">35   %6 = tail call <256
                          x i32> @llvm.x86.tdpbssd(i16 %row, i16
                          %conv.i31, i16 %conv.i31, <256 x i32>
                          %c.sroa.1149.0, <256 x i32>
                          %a.sroa.1186.0, <256 x i32>
                          %b.sroa.1068.0) #3<o:p></o:p></p>
                        <p class="MsoNormal">36   tail call void
                          @llvm.x86.tilestored64(i16 %row, i16
                          %conv.i31, i8* getelementptr inbounds ([1024 x
                          i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32,
                          <256 x i32> %6) #3<o:p></o:p></p>
                        <p class="MsoNormal">37   ret void<o:p></o:p></p>
                        <p class="MsoNormal">38 }<o:p></o:p></p>
                        <p class="MsoListParagraph"
                          style="margin-left:.5in;text-indent:-.25in;mso-list:l0
                          level1 lfo4">
                          <!--[if !supportLists]--><span
                            style="mso-list:Ignore">6.<span
                              style="font:7.0pt "Times New
                              Roman"">      
                            </span></span><!--[endif]-->Shape
                          propagation<o:p></o:p></p>
                        <p class="MsoNormal">When in -O0 build, some
                          general load/store for tile vector is
                          generated by front-end. We need to root from
                          AMX intrinsics to propagate the shape
                          information to the virtual tile register. If
                          the an AMX intrinsic use the result of load
                          instruction, the shape is propagated to the
                          load and the load is transformed to tile load
                          intrinsic. If the store instruction uses any
                          result of AMX intrinsic, the shape is
                          propagated to store instruction and the store
                          is transformed to tile store intrinsic<o:p></o:p></p>
                        <p class="MsoListParagraph"
                          style="margin-left:.5in;text-indent:-.25in;mso-list:l0
                          level1 lfo4">
                          <!--[if !supportLists]--><span
                            style="mso-list:Ignore">7.<span
                              style="font:7.0pt "Times New
                              Roman"">      
                            </span></span><!--[endif]-->Machine IR<o:p></o:p></p>
                        <p class="MsoNormal">Since the AMX intrinsics
                          take the row and column as the input
                          parameters, we can create a pseudo instruction
                          corresponding to it. The AMX intrinsics are
                          lowered to the pseudo AMX instruction which
                          has extra row and column operands
                          corresponding to AMX intrinsic. The real AMX
                          instructions don’t need the row and column
                          operands. The row and column information
                          should be configured by ldtilecfg before
                          executing any AMX instruction.<o:p></o:p></p>
                        <p class="MsoListParagraph"
                          style="margin-left:.5in;text-indent:-.25in;mso-list:l0
                          level1 lfo4">
                          <!--[if !supportLists]--><span
                            style="mso-list:Ignore">8.<span
                              style="font:7.0pt "Times New
                              Roman"">      
                            </span></span><!--[endif]-->Register
                          allocation<o:p></o:p></p>
                        <p class="MsoNormal">AMX register is special. It
                          needs to be configured before use and the
                          config instruction is expensive. To avoid
                          unnecessary tile configure, we collect the
                          tile shape information as much as possible and
                          combine them into one ldtilecfg instruction.
                          The ldtilecfg instruction should dominate any
                          AMX instruction that access tile register. On
                          the other side, the ldtilecfg should
                          post-dominated the instruction that define the
                          tile shape. For tile register spill, it should
                          avoid re-config due to the different tile
                          shape, the spilled register should be reloaded
                          to the register that share the same tile
                          shape. Since tile register allocation is
                          special and it may allocate general virtual
                          register to configure tile register, we can
                          add a sperate pass to do it before general
                          register allocation pass. After register
                          allocation, the tile shape information is not
                          needed anymore, so we can transform the pseudo
                          AMX instruction to real AMX instruction by
                          removing the row and column operands.<o:p></o:p></p>
                      </blockquote>
                      <p>This seems complicated.<o:p></o:p></p>
                      <p>Reading through the documentation, there
                        appears to be a single global tile config for
                        all tile registers at any time.<o:p></o:p></p>
                      <p>Why not simply model this tile config as a
                        designated special register and the tile
                        instructions as having an implicit use of this
                        register?  That would seem to ensure that the
                        register allocator has all the constraints
                        needed.  You'd need to teach it how to spill the
                        special registers with the appropriate
                        instructions, but that seems a lot more straight
                        forward?<o:p></o:p></p>
                      <blockquote
                        style="margin-top:5.0pt;margin-bottom:5.0pt">
                        <p class="MsoListParagraph"
                          style="margin-left:.5in;text-indent:-.25in;mso-list:l0
                          level1 lfo4">
                          <!--[if !supportLists]--><span
                            style="mso-list:Ignore">9.<span
                              style="font:7.0pt "Times New
                              Roman"">      
                            </span></span><!--[endif]-->Use
                          recommendation <o:p></o:p></p>
                        <p class="MsoNormal">Due to the shape configure
                          issue, we recommend user to define the tile
                          shape at the entry of the function entry and
                          inline function as much as possible. The AMX
                          instructions focus on computation instead of
                          storage, so global variable for tile data is
                          not recommended.<o:p></o:p></p>
                        <p class="MsoNormal"><span
                            style="font-size:10.5pt;line-height:105%"> </span><o:p></o:p></p>
                        <p class="MsoNormal"><span
                            style="font-size:10.5pt;line-height:105%">Thanks</span><o:p></o:p></p>
                        <p class="MsoNormal"><span
                            style="font-size:10.5pt;line-height:105%">Yuanke</span><o:p></o:p></p>
                        <p class="MsoNormal"
                          style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
                          <br>
                          <br>
                          <br>
                          <br>
                          <br>
                          <br>
                          <br>
                          <br>
                          <br>
                          <o:p></o:p></p>
                        <pre>_______________________________________________<o:p></o:p></pre>
                        <pre>LLVM Developers mailing list<o:p></o:p></pre>
                        <pre><a href="mailto:llvm-dev@lists.llvm.org" moz-do-not-send="true">llvm-dev@lists.llvm.org</a><o:p></o:p></pre>
                        <pre><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></pre>
                      </blockquote>
                    </blockquote>
                    <p class="MsoNormal"
                      style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
                      <br>
                      <br>
                      <br>
                      <br>
                      <br>
                      <br>
                      <br>
                      <o:p></o:p></p>
                    <pre>_______________________________________________<o:p></o:p></pre>
                    <pre>LLVM Developers mailing list<o:p></o:p></pre>
                    <pre><a href="mailto:llvm-dev@lists.llvm.org" moz-do-not-send="true">llvm-dev@lists.llvm.org</a><o:p></o:p></pre>
                    <pre><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></pre>
                  </blockquote>
                  <pre>-- <o:p></o:p></pre>
                  <pre>Hal Finkel<o:p></o:p></pre>
                  <pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
                  <pre>Leadership Computing Facility<o:p></o:p></pre>
                  <pre>Argonne National Laboratory<o:p></o:p></pre>
                </blockquote>
                <pre>-- <o:p></o:p></pre>
                <pre>Hal Finkel<o:p></o:p></pre>
                <pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
                <pre>Leadership Computing Facility<o:p></o:p></pre>
                <pre>Argonne National Laboratory<o:p></o:p></pre>
              </blockquote>
              <pre>-- <o:p></o:p></pre>
              <pre>Hal Finkel<o:p></o:p></pre>
              <pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
              <pre>Leadership Computing Facility<o:p></o:p></pre>
              <pre>Argonne National Laboratory<o:p></o:p></pre>
            </blockquote>
            <pre>-- <o:p></o:p></pre>
            <pre>Hal Finkel<o:p></o:p></pre>
            <pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
            <pre>Leadership Computing Facility<o:p></o:p></pre>
            <pre>Argonne National Laboratory<o:p></o:p></pre>
          </blockquote>
          <pre>-- <o:p></o:p></pre>
          <pre>Hal Finkel<o:p></o:p></pre>
          <pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
          <pre>Leadership Computing Facility<o:p></o:p></pre>
          <pre>Argonne National Laboratory<o:p></o:p></pre>
        </blockquote>
        <pre>-- <o:p></o:p></pre>
        <pre>Hal Finkel<o:p></o:p></pre>
        <pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
        <pre>Leadership Computing Facility<o:p></o:p></pre>
        <pre>Argonne National Laboratory<o:p></o:p></pre>
      </div>
    </blockquote>
    <pre class="moz-signature" cols="72">-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
  </body>
</html>