<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">

  </head>

  <body>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 9/4/20 8:50 AM, Luo, Yuanke wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:SN6PR11MB3135FD8ECCEAE494295CE9759A2D0@SN6PR11MB3135.namprd11.prod.outlook.com">

      <meta name="Generator" content="Microsoft Word 15 (filtered

        medium)">

      <style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:DengXian;

        panose-1:2 1 6 0 3 1 1 1 1 1;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:"\@DengXian";

        panose-1:2 1 6 0 3 1 1 1 1 1;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin-top:0in;

        margin-right:0in;

        margin-bottom:8.0pt;

        margin-left:0in;

        line-height:105%;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:#0563C1;

        text-decoration:underline;}

pre

        {mso-style-priority:99;

        mso-style-link:"HTML Preformatted Char";

        margin:0in;

        margin-bottom:.0001pt;

        font-size:10.0pt;

        font-family:"Courier New";}

p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph

        {mso-style-priority:34;

        margin-top:0in;

        margin-right:0in;

        margin-bottom:8.0pt;

        margin-left:0in;

        text-indent:21.0pt;

        line-height:105%;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}

span.HTMLPreformattedChar

        {mso-style-name:"HTML Preformatted Char";

        mso-style-priority:99;

        mso-style-link:"HTML Preformatted";

        font-family:"Courier New";}

span.EmailStyle24

        {mso-style-type:personal-reply;

        font-family:"Calibri",sans-serif;

        color:windowtext;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.WordSection1

        {page:WordSection1;}mso-level-tab-stop:4.5in;

        mso-level-number-position:left;

        text-indent:-.25in;}

ol

        {margin-bottom:0in;}

ul

        {margin-bottom:0in;}</style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

      <div class="WordSection1">

        <p class="MsoNormal">Fix typo<o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <div>

          <div style="border:none;border-top:solid #E1E1E1

            1.0pt;padding:3.0pt 0in 0in 0in">

            <p class="MsoNormal"

              style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

              <b>From:</b> Luo, Yuanke <br>

              <b>Sent:</b> Friday, September 4, 2020 9:47 PM<br>

              <b>To:</b> 'Hal Finkel' <a class="moz-txt-link-rfc2396E" href="mailto:hfinkel@anl.gov"><hfinkel@anl.gov></a>; Topper,

              Craig <a class="moz-txt-link-rfc2396E" href="mailto:craig.topper@intel.com"><craig.topper@intel.com></a>; Kaylor, Andrew

              <a class="moz-txt-link-rfc2396E" href="mailto:andrew.kaylor@intel.com"><andrew.kaylor@intel.com></a>; Philip Reames

              <a class="moz-txt-link-rfc2396E" href="mailto:listmail@philipreames.com"><listmail@philipreames.com></a>;

              <a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>; <a class="moz-txt-link-abbreviated" href="mailto:florian_hahn@apple.com">florian_hahn@apple.com</a>; Lu,

              Hongjiu <a class="moz-txt-link-rfc2396E" href="mailto:hongjiu.lu@intel.com"><hongjiu.lu@intel.com></a><br>

              <b>Subject:</b> RE: [llvm-dev] Intel AMX programming model

              discussion.<o:p></o:p></p>

          </div>

        </div>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal">Hi Hal,<o:p></o:p></p>

        <p class="MsoNormal">Generally, your proposal to adapt tile RA

          to Greedy RA looks good to me. Thank you! I plan to do some

          prototype for the proposal. Since there is 3 RA in LLVM

          infrastructure, we need 3 schemes to adapt tile RA to each

          existing RA. Do you like to finalize the 3 schemes first, or

          you would like to review the left part of the AMX programming

          model? We have some limitation to support dynamic shape and

          I’d like to hear your advice. The dynamic shape requires the

          ldtilecfg post-dominate the point that define shape, so we

          encourage user to define their shape in the entry of the

          function. Take below code as example. Ideally, we hope to

          insert ldtilecfg at line 57 to config a, b, c, but in this

          function the c’s shape {row, col} is defined in each if/else

          clause. So at line 57, the shape of c in unknown. Do you have

          any advice for such problem?</p>

      </div>

    </blockquote>

    <p>In the example below, I'm going to assume that the function calls

      are actually to get_row1() and get_row2(), neither of which can be

      hoisted.</p>

    <p>Just to think about this: First, we're starting the MIR with

      intrinsics that take the shape parameters directly. Now you need

      to:</p>

    <p> 1. Identify "configuration regions". Because reconfiguring must

      be done for all registers at once, and because reconfiguring zeros

      all of the tile registers, each configuration region is a

      connected component in the union of the live ranges of all virtual

      tile registers. Thus, first collect the configuration regions via

      trivial clustering (two instructions are part of the same

      configuration region is they share any live range of a tile

      register).</p>

    <p> 2. If the region will require more than eight types of shapes,

      then you'll need to calculate a min cut of the region, split the

      region by inserting spill/restores, so that the region requires

      only <= 8 number of shapes.<br>

    </p>

    <p> 3. If you do it this way, all of the instructions in your code

      below will be part of one, big configuration region. Generally,

      you want to put the ldtilecfg at the common dominating point of

      all of the tile instructions in the region. Now, as you point out

      in your example below, we can't simply put the ldtilecfg at the

      common dominating point: that point might not actually be

      dominated by the definitions of all of the shape inputs needed.<br>

    </p>

    <p> 4. One thing that you might do is iterative splitting. If not

      all of the definitions of the shape inputs dominate the desired

      insertion point, first you might try iteratively hosting the

      defining instructions to make it so the definitions do dominate.

      If they still don't, then split the ldtilecfg into each successor

      of the desired insertion point. Do this recursively until, for

      each ldtilecfg, the inputs for each dynamic-shape tile register

      size dominate the insertion point.</p>

    <p> 5. This procedure, alone, might fail in the case where the

      ldtilecfg is sunk past the point of definition of one of the tile

      registers. Imagine, in your example below, that there was some use

      of the tile registers a and b before the if. In that case, you'll

      need to split those live ranges by spilling into memory around the

      desired ldtilecfg insertion point. That creates a new

      configuration region that you'll insert into the queue of

      configuration regions to process.<br>

    </p>

    <p>I'm sure that this is not the only possible heuristic. This would

      be easier, I think, if the hardware did not zero all of the

      registers when you reconfigured any of them, but I suppose that it

      is what it is at this point.</p>

    <p> -Hal<br>

    </p>

    <p><br>

    </p>

    <blockquote type="cite"

cite="mid:SN6PR11MB3135FD8ECCEAE494295CE9759A2D0@SN6PR11MB3135.namprd11.prod.outlook.com">

      <div class="WordSection1">

        <p class="MsoNormal"><o:p></o:p></p>

        <p class="MsoNormal">52 void kernel(int cond) {<o:p></o:p></p>

        <p class="MsoNormal">53   _tile a = {row, 8};<o:p></o:p></p>

        <p class="MsoNormal">54   _tile b = {8, col};<o:p></o:p></p>

        <p class="MsoNormal">55<o:p></o:p></p>

        <p class="MsoNormal">56   // copy shape to stack slot<o:p></o:p></p>

        <p class="MsoNormal">57   // ldtilecfg a, b, c<o:p></o:p></p>

        <p class="MsoNormal">58   if(cond) {<o:p></o:p></p>

        <p class="MsoNormal">59     short row = get_row();<o:p></o:p></p>

        <p class="MsoNormal">60     short col = get_row();<o:p></o:p></p>

        <p class="MsoNormal">61     _tile c = {row, col};<o:p></o:p></p>

        <p class="MsoNormal">62     __tile_loadd(&a, buf, STRIDE);<o:p></o:p></p>

        <p class="MsoNormal">63     __tile_loadd(&b, buf, STRIDE);<o:p></o:p></p>

        <p class="MsoNormal">64     __tile_loadd(&c, buf, STRIDE);<o:p></o:p></p>

        <p class="MsoNormal">65   } else {<o:p></o:p></p>

        <p class="MsoNormal">66     short row = get_row();<o:p></o:p></p>

        <p class="MsoNormal">67     short col = get_row();<o:p></o:p></p>

        <p class="MsoNormal">68     _tile c = {row, col};<o:p></o:p></p>

        <p class="MsoNormal">69     __tile_loadd(&a, buf2, STRIDE);<o:p></o:p></p>

        <p class="MsoNormal">70     __tile_loadd(&b, buf2, STRIDE);<o:p></o:p></p>

        <p class="MsoNormal">71     __tile_loadd(&c, buf2, STRIDE);<o:p></o:p></p>

        <p class="MsoNormal">72   }<o:p></o:p></p>

        <p class="MsoNormal">73   __tile_dpbsud(&c, a, b);<o:p></o:p></p>

        <p class="MsoNormal">74   __tile_stored(buf, STRIDE, c);<o:p></o:p></p>

        <p class="MsoNormal">75 }<o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal">Thanks<o:p></o:p></p>

        <p class="MsoNormal">Yuanke<o:p></o:p></p>

        <div>

          <div style="border:none;border-top:solid #E1E1E1

            1.0pt;padding:3.0pt 0in 0in 0in">

            <p class="MsoNormal"

              style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

              <b>From:</b> Hal Finkel <<a

                href="mailto:hfinkel@anl.gov" moz-do-not-send="true">hfinkel@anl.gov</a>>

              <br>

              <b>Sent:</b> Friday, September 4, 2020 5:59 PM<br>

              <b>To:</b> Luo, Yuanke <<a

                href="mailto:yuanke.luo@intel.com"

                moz-do-not-send="true">yuanke.luo@intel.com</a>>;

              Topper, Craig <<a href="mailto:craig.topper@intel.com"

                moz-do-not-send="true">craig.topper@intel.com</a>>;

              Kaylor, Andrew <<a

                href="mailto:andrew.kaylor@intel.com"

                moz-do-not-send="true">andrew.kaylor@intel.com</a>>;

              Philip Reames <<a

                href="mailto:listmail@philipreames.com"

                moz-do-not-send="true">listmail@philipreames.com</a>>;

              <a href="mailto:llvm-dev@lists.llvm.org"

                moz-do-not-send="true">llvm-dev@lists.llvm.org</a>; <a

                href="mailto:florian_hahn@apple.com"

                moz-do-not-send="true">

                florian_hahn@apple.com</a>; Lu, Hongjiu <<a

                href="mailto:hongjiu.lu@intel.com"

                moz-do-not-send="true">hongjiu.lu@intel.com</a>><br>

              <b>Subject:</b> Re: [llvm-dev] Intel AMX programming model

              discussion.<o:p></o:p></p>

          </div>

        </div>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p><o:p> </o:p></p>

        <div>

          <p class="MsoNormal">On 9/4/20 3:37 AM, Luo, Yuanke wrote:<o:p></o:p></p>

        </div>

        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

          <p class="MsoNormal">Hi Hal,<o:p></o:p></p>

          <p class="MsoNormal">Thank you for the ideas that help us to

            improve the design, and sorry for replying late. There is

            something I am not able to figure out and there some special

            trait for tile RA.<o:p></o:p></p>

        </blockquote>

        <p><o:p> </o:p></p>

        <p>You're quite welcome.<o:p></o:p></p>

        <p><o:p> </o:p></p>

        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

          <p class="MsoListParagraph"

            style="margin-left:.5in;text-indent:-.25in;mso-list:l0

            level1 lfo2">

            <!--[if !supportLists]--><span style="mso-list:Ignore">1.<span

                style="font:7.0pt "Times New Roman"">      

              </span></span><!--[endif]-->X86RegisterInfo::getRegAllocationHints

            can tell RA which physical register is preferred, but it

            can’t force RA to just allocate the hinted register. If the

            hinted register is not meet, RA would allocate other

            register.<o:p></o:p></p>

        </blockquote>

        <p><o:p> </o:p></p>

        <p>I addressed this below, but I could have been clearer. Like

          SystemZRegisterInfo::getRegAllocationHints does sometimes,

          when hinting the tile registers, the function will return

          true. This turns the preference into a hard constraint, and

          the allocator will not allocate any other register. That's my

          understanding from reading the code.<o:p></o:p></p>

        <p><o:p> </o:p></p>

        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

          <p class="MsoListParagraph"

            style="margin-left:.5in;text-indent:-.25in;mso-list:l0

            level1 lfo2">

            <!--[if !supportLists]--><span style="mso-list:Ignore">2.<span

                style="font:7.0pt "Times New Roman"">      

              </span></span><!--[endif]-->The shape information should

            be attached to each virtual register and physical register

            which is allocated. How to store and get the shape

            information with limited code change on existing RA?<o:p></o:p></p>

        </blockquote>

        <p><o:p> </o:p></p>

        <p>For each virtual register, getRegAllocationHints could just

          recompute the shape information. If this isn't a constant-time

          operation, however, you'll probably want to cache the computed

          shape requirements in X86MachineFunctionInfo. You can add a

          map from registers to shape information in that class, and

          accesses it from getRegAllocationHints. You can store

          information about the physical registers there too.<o:p></o:p></p>

        <p>Regarding the physical registers, you can grab this

          information in the pre-rewrite phase. Override addPreRewrite

          in X86TargetMachine.cpp. You'll need a small pass that records

          relevant information about the assignments (which, I imagine,

          is the same small pass that updates the LDTILECFG

          instructions). For an example of such a pass, see

          AMDGPU/GCNNSAReassign.cpp<o:p></o:p></p>

        <p><o:p> </o:p></p>

        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

          <p class="MsoListParagraph"

            style="margin-left:.5in;text-indent:-.25in;mso-list:l0

            level1 lfo2">

            <!--[if !supportLists]--><span style="mso-list:Ignore">3.<span

                style="font:7.0pt "Times New Roman"">      

              </span></span><!--[endif]-->When a tile register is

            spilled, the shape should also be bound the corresponding

            spill stack slot, so that it can be assigned the physical

            tile register with the same shape.<o:p></o:p></p>

        </blockquote>

        <p><o:p> </o:p></p>

        <p>I'm not sure what you mean. If you don't want to just be

          conservative about the spill size allocation, you do need to

          know the shape in order to compute the spill-location size. I

          assume that you can grab that out of X86MachineFunctionInfo

          from storeRegToStackSlot/loadRegFromStackSlot or

          eliminateFrameIndex (or copyPhysReg) as needed.<o:p></o:p></p>

        <p><o:p> </o:p></p>

        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

          <p class="MsoListParagraph"

            style="margin-left:.5in;text-indent:-.25in;mso-list:l0

            level1 lfo2">

            <!--[if !supportLists]--><span style="mso-list:Ignore">4.<span

                style="font:7.0pt "Times New Roman"">      

              </span></span><!--[endif]-->There is no mov/copy

            instruction for tile register. To copy tile register, we

            need to store the tile register to memory and load the data

            from memory to another register. So a lot of code for live

            interval split in Greedy RA is unnecessary for tile register

            allocation.<o:p></o:p></p>

        </blockquote>

        <p><o:p> </o:p></p>

        <p>Yes, but this just means that you need to support copying

          through memory. Setting CopyCost = -1 in X86RegisterInfo.td

          might help as well.<o:p></o:p></p>

        <p><o:p> </o:p></p>

        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

          <p class="MsoListParagraph"

            style="margin-left:.5in;text-indent:-.25in;mso-list:l0

            level1 lfo2">

            <!--[if !supportLists]--><span style="mso-list:Ignore">5.<span

                style="font:7.0pt "Times New Roman"">      

              </span></span><!--[endif]-->Compiler can support register

            spill, but spill should be avoided for performance benefit.

            We prefer reporting warning on register spill, so that user

            can realize it and adjust their code to avoid register

            spill.<o:p></o:p></p>

        </blockquote>

        <p><o:p> </o:p></p>

        <p>If you want to emit a diagnostic, you may be able to do that

          from storeRegToStackSlot. In any case, please make use of the

          optimization-remark infrastructure. For an example of how to

          do this, see RAGreedy::reportNumberOfSplillsReloads in

          RegAllocGreedy.cpp.<o:p></o:p></p>

        <p><o:p> </o:p></p>

        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

          <p class="MsoNormal"> <o:p></o:p></p>

          <p class="MsoNormal">If there is no easy way to take the

            advantage of current RA infrastructure, there are some pros

            to have a separate RA for tile register.<o:p></o:p></p>

          <p class="MsoListParagraph"

            style="margin-left:.5in;text-indent:-.25in;mso-list:l1

            level1 lfo4">

            <!--[if !supportLists]--><span style="mso-list:Ignore">1.<span

                style="font:7.0pt "Times New Roman"">      

              </span></span><!--[endif]-->We can limit the risk to break

            RA for general register on each arch. If there are some bugs

            on tile RA, only application that use AMX is affected.<o:p></o:p></p>

        </blockquote>

        <p><o:p> </o:p></p>

        <p>That's true. But I also worry about that. Any time you need

          to write non-trivial code that will be used relatively rarely,

          it's likely to have bugs that take a long time to show up. If

          you can plug into the generic infrastructure, you benefit from

          the fact that it's highly-covered, often-used code. Not that

          you might not run into bugs, of course, especially if you're

          using it in a new way, but the base logic is likely to already

          be robust.<o:p></o:p></p>

        <p><o:p> </o:p></p>

        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

          <p class="MsoListParagraph"

            style="margin-left:.5in;text-indent:-.25in;mso-list:l1

            level1 lfo4">

            <!--[if !supportLists]--><span style="mso-list:Ignore">2.<span

                style="font:7.0pt "Times New Roman"">      

              </span></span><!--[endif]-->We can customize the special

            trait (config, spilt, spill) of tile register in the sperate

            RA more freely.<o:p></o:p></p>

        </blockquote>

        <p><o:p> </o:p></p>

        <p>True.<o:p></o:p></p>

        <p> -Hal<o:p></o:p></p>

        <p><o:p> </o:p></p>

        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

          <p class="MsoNormal"> <o:p></o:p></p>

          <p class="MsoNormal">For RegAllocFast, I agree with you. Each

            region of register is small, and since the performance is

            not the first priority, we can insert multiply config for

            each small region.<o:p></o:p></p>

          <p class="MsoNormal">As you recommend looking at the PBQP

            solver, I’ll take some time to investigate it and go back to

            you.<o:p></o:p></p>

          <p class="MsoNormal"> <o:p></o:p></p>

          <p class="MsoNormal">Thanks<o:p></o:p></p>

          <p class="MsoNormal">-Yuanke<o:p></o:p></p>

          <p class="MsoNormal"> <o:p></o:p></p>

          <p class="MsoNormal"> <o:p></o:p></p>

          <div>

            <div style="border:none;border-top:solid #E1E1E1

              1.0pt;padding:3.0pt 0in 0in 0in">

              <p class="MsoNormal"

                style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                <b>From:</b> Hal Finkel <a

                  href="mailto:hfinkel@anl.gov" moz-do-not-send="true"><hfinkel@anl.gov></a>

                <br>

                <b>Sent:</b> Monday, August 24, 2020 5:03 PM<br>

                <b>To:</b> Luo, Yuanke <a

                  href="mailto:yuanke.luo@intel.com"

                  moz-do-not-send="true"><yuanke.luo@intel.com></a>;

                Topper, Craig

                <a href="mailto:craig.topper@intel.com"

                  moz-do-not-send="true"><craig.topper@intel.com></a>;

                Kaylor, Andrew

                <a href="mailto:andrew.kaylor@intel.com"

                  moz-do-not-send="true"><andrew.kaylor@intel.com></a>;

                Philip Reames

                <a href="mailto:listmail@philipreames.com"

                  moz-do-not-send="true"><listmail@philipreames.com></a>;

                <a href="mailto:llvm-dev@lists.llvm.org"

                  moz-do-not-send="true">

                  llvm-dev@lists.llvm.org</a>; <a

                  href="mailto:florian_hahn@apple.com"

                  moz-do-not-send="true">florian_hahn@apple.com</a>; Lu,

                Hongjiu

                <a href="mailto:hongjiu.lu@intel.com"

                  moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>

                <b>Subject:</b> Re: [llvm-dev] Intel AMX programming

                model discussion.<o:p></o:p></p>

            </div>

          </div>

          <p class="MsoNormal"> <o:p></o:p></p>

          <p>Hi, Yuanke,<o:p></o:p></p>

          <p>Thanks for writing this up. Let me back up a bit because

            the scheme I proposed last week doesn't work without further

            modification: within a particular "configuration region"

            (i.e., the code in between the LDTILECFG and the TILERELEASE

            (or next LDTILECFG)), each tile register can only be used

            with one shape, and in addition, no register can have its

            shape changed without zeroing out all of the tile registers.

            Thus, just using different register classes for the

            different shapes, as I had suggested, isn't sufficient to

            model the allocation requirements. That would not prevent

            the same register from essentially being assigned to

            differently-shaped virtual registers with non-overlapping

            live ranges within one configuration region.<o:p></o:p></p>

          <p>Also, as you point out, when multiple non-static tile

            shapes are in use, if you use one register class for each

            shape, you would need different register classes for these

            too. Luckily, I don't think that using the separate register

            classes actually buys us anything, so please disregard that

            suggestion of mine. Use only one register class.<o:p></o:p></p>

          <p>Once the configuration regions are identified, you'll know

            how many tile register shapes are required. If this number

            is greater than eight, then you'll need to cut the region

            (requiring all live tiles to be spilled and restored around

            each re-configuration point). After that, we'll assume that

            we have eight or fewer distinct shapes.<o:p></o:p></p>

          <p>Now the problem is that you need to allocate registers,

            satisfying all of the usual constraints (non-overlapping

            live ranges, etc.), but with an additional constraint: once

            a physical register has been used with some particular tile

            shape, it cannot be assigned to any other tile shape.<o:p></o:p></p>

          <p>I think that the current infrastructure can support this as

            follows:<o:p></o:p></p>

          <p> 1. Add an override X86RegisterInfo::getRegAllocationHints.

            Like SystemZRegisterInfo::getRegAllocationHints does

            sometimes, when hinting the tile registers, the function

            will return true (to indicate a hard constraint). As

            registers are assigned in RegAllocGreedy,

            getRegAllocationHints is called for each virtual register.

            For virtual tile registers, look at the passed VirtRegMap,

            etc. for already-assigned tile virtual registers with

            different shape requirements as the current virtual register

            (you'll need to cache the shape requirements in

            X86MachineFunctionInfo for this to be efficient), and return

            a hints list consisting of all other non-reserved tile

            registers.<o:p></o:p></p>

          <p> 2. To support RegAllocFast, which doesn't use

            getRegAllocationHints, you would need to make the

            configuration regions small enough that it doesn't matter

            (and if you're doing this around every tile instruction,

            this is automatically true).<o:p></o:p></p>

          <p> 3. To support RegAllocPBQP (which is likely a good thing

            to do, but probably not required), I believe you can support

            this by adding custom constraints to the solver (kind of

            like what AArch64PBQPRegAlloc.cpp does).<o:p></o:p></p>

          <p>Once the allocation process is complete, you'll need to go

            back and update the LDTILECFG data to reflect the chosen

            shape -> register mapping.<o:p></o:p></p>

          <p>What I don't know, however, is how well the

            getRegAllocationHints method will work. The benefit is that

            you don't need to write a custom pre-allocator allocator. On

            the other hand, it might visit the virtual registers to

            assign in a suboptimal order because it doesn't really

            understand the constraint being imposed (generally, we just

            assign larger live ranges first). On the other hand, it is a

            greedy algorithm and if you want something systematically

            closer to optimal, maybe you should be using PBQP anyway. If

            you do end up needing a custom allocator for these, I

            recommend looking at the PBQP solver (which, as I recall, is

            independently reusable).<o:p></o:p></p>

          <p>Hopefully, this is more-helpful advice.<o:p></o:p></p>

          <p> -Hal<o:p></o:p></p>

          <div>

            <p class="MsoNormal">On 8/21/20 9:54 PM, Luo, Yuanke wrote:<o:p></o:p></p>

          </div>

          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

            <div>

              <p class="MsoNormal">It seems I make a mistake on sharing

                register unit. Can we share register unit for tile

                register that is within different tile register class

                (different register class has different tile shape)?

                 Think about two virtual tile register

                <i>%2:vtile1x1 </i>and <i>%3:vtile1x2</i>. First %2 is

                allocated to $tmm0, after that %2 is killed and %t3 is

                allocated to $tmm0. This is not allowed, because when

                $tmm0 is allocated to %2, its shape is configured to

                1x1. If we reallocated $tmm0 to %3, then we need to

                re-config $tmm0 to 1x2 which cause $tmm0~$tmm7 be

                clobbered.<o:p></o:p></p>

              <p class="MsoNormal"> <o:p></o:p></p>

              <p class="MsoNormal">Yuanke<o:p></o:p></p>

              <p class="MsoNormal"> <o:p></o:p></p>

              <div>

                <div style="border:none;border-top:solid #E1E1E1

                  1.0pt;padding:3.0pt 0in 0in 0in">

                  <p class="MsoNormal"

                    style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                    <b>From:</b> Luo, Yuanke <br>

                    <b>Sent:</b> Friday, August 21, 2020 2:12 PM<br>

                    <b>To:</b> Hal Finkel <a

                      href="mailto:hfinkel@anl.gov"

                      moz-do-not-send="true"><hfinkel@anl.gov></a>;

                    Topper, Craig

                    <a href="mailto:craig.topper@intel.com"

                      moz-do-not-send="true"><craig.topper@intel.com></a>;

                    Kaylor, Andrew

                    <a href="mailto:andrew.kaylor@intel.com"

                      moz-do-not-send="true"><andrew.kaylor@intel.com></a>;

                    Philip Reames

                    <a href="mailto:listmail@philipreames.com"

                      moz-do-not-send="true"><listmail@philipreames.com></a>;

                    <a href="mailto:llvm-dev@lists.llvm.org"

                      moz-do-not-send="true">

                      llvm-dev@lists.llvm.org</a>; <a

                      href="mailto:florian_hahn@apple.com"

                      moz-do-not-send="true">florian_hahn@apple.com</a>;

                    Lu, Hongjiu

                    <a href="mailto:hongjiu.lu@intel.com"

                      moz-do-not-send="true"><hongjiu.lu@intel.com></a><br>

                    <b>Subject:</b> RE: [llvm-dev] Intel AMX programming

                    model discussion.<o:p></o:p></p>

                </div>

              </div>

              <p class="MsoNormal"> <o:p></o:p></p>

              <p class="MsoNormal">Hi Hal,<o:p></o:p></p>

              <p class="MsoNormal">The proposal is attractive to me, but

                there is something I still can’t figure out. Let’s take

                below MIR as an example. We assume we have 256 register

                classes (vtile1x1, vtile1x2, …, tile16x16).<o:p></o:p></p>

              <p class="MsoListParagraph"

                style="margin-left:.5in;text-indent:-.25in;mso-list:l3

                level1 lfo6">

                <!--[if !supportLists]--><span style="mso-list:Ignore">1.<span

                    style="font:7.0pt "Times New Roman"">      

                  </span></span><!--[endif]-->After instruction

                selection, the pseudo AMX instruction is generated. The

                name of pseudo instructions have ‘P’ prefix. Now all the

                AMX pseudo instruction take vtile as register class.

                Let’s assume %13 is constant 3, %10 is constant 4 and

                %14 is variable.<o:p></o:p></p>

              <p class="MsoNormal"><i>  %1:vtile = <b><span

                      style="color:red">P</span></b>TILELOADDV %13:gr16,

                  %10:gr16, %17:gr64, 1, %18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>

              <p class="MsoNormal"><i>  %2:vtile = <b>P</b>TILELOADDV

                  %10:gr16, %14:gr16, %17:gr64, 1, %18:gr64_nosp, 0,

                  $noreg</i><o:p></o:p></p>

              <p class="MsoNormal"><i>  %3:vtile = <b>P</b>TILELOADDV

                  %13:gr16, %14:gr16, %17:gr64, 1, %18:gr64_nosp, 0,

                  $noreg</i><o:p></o:p></p>

              <p class="MsoNormal"><i>%21:vtile = <b>P</b>TDPBSSDV

                  %13:gr16, %10:gr16, %14:gr16, %3:vtile(tied-def 0),

                  %1:vtile, %2:vtile

                </i><o:p></o:p></p>

              <p class="MsoListParagraph"

                style="margin-left:.5in;text-indent:-.25in;mso-list:l3

                level1 lfo6">

                <!--[if !supportLists]--><span style="mso-list:Ignore">2.<span

                    style="font:7.0pt "Times New Roman"">      

                  </span></span><!--[endif]-->The

                configuration-placement pass looks at all of the AMX

                pseudo-instructions and identifies regions in which the

                pseudo-instructions use the same configuration

                parameters. It first replaces the register class for all

                tile registers whose shape is known in compile-time.

                Since the shape of %1 is constant, so it replaces

                %1:vtile with %1:vtile3x4 which change the register

                class and morph pseudo instruction into AMX real

                instruction. The shape of %2 and %3 is unknown in

                compile-time, so it arbitrarily picks up a tile register

                class which is not assigned before and assign the

                register class to %2 and %3. After register class

                allocation, the code is transformed as this. The

                register class for %2:vtile1x1 and %3:vtile1x2 is

                allocated.

                <o:p></o:p></p>

              <p class="MsoNormal"><i>   <b>P</b>LDTILECFG</i><o:p></o:p></p>

              <p class="MsoNormal"><i>  %1:vtile3x4  = TILELOADDV

                  %17:gr64, 1, %18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>

              <p class="MsoNormal"><i>  %2:vtile1x1 = TILELOADDV

                  %17:gr64, 1, %18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>

              <p class="MsoNormal"><i>  %3:vtile1x2 = TILELOADDV

                  %17:gr64, 1, %18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>

              <p class="MsoNormal"><i>%21:vtile1x2 = TDPBSSDV

                  %9:vtile1x2(tied-def 0), %1:vtile3x4, %2:vtile1x1

                </i><o:p></o:p></p>

              <p class="MsoNormal">Something I am not figured out. <o:p></o:p></p>

              <p class="MsoListParagraph"

                style="margin-left:.5in;text-indent:-.25in;mso-list:l2

                level1 lfo8">

                <!--[if !supportLists]--><span style="mso-list:Ignore">1.<span

                    style="font:7.0pt "Times New Roman"">      

                  </span></span><!--[endif]-->I not sure if we can have

                AMX instruction’s inputs and outputs fit multiple

                register classes (vtile1x1, …, vtile16x16), otherwise we

                need 256 pseudo instructions.<o:p></o:p></p>

              <p class="MsoListParagraph"

                style="margin-left:.5in;text-indent:-.25in;mso-list:l2

                level1 lfo8">

                <!--[if !supportLists]--><span style="mso-list:Ignore">2.<span

                    style="font:7.0pt "Times New Roman"">      

                  </span></span><!--[endif]-->Whether 256 register class

                is enough to be allocated. There may be more 256 unknow

                shape tile registers.<o:p></o:p></p>

              <p class="MsoListParagraph"

                style="margin-left:.5in;text-indent:-.25in;mso-list:l2

                level1 lfo8">

                <!--[if !supportLists]--><span style="mso-list:Ignore">3.<span

                    style="font:7.0pt "Times New Roman"">      

                  </span></span><!--[endif]-->In this pass we also find

                the proper pointer (common dominator) to insert

                ldtilecfg, but at this time the register is allocated,

                we don’t know the shape of each physical tile register.

                So we just insert a pseudo tile config instruction.<o:p></o:p></p>

              <p class="MsoListParagraph"

                style="margin-left:.5in;text-indent:-.25in;mso-list:l3

                level1 lfo6">

                <!--[if !supportLists]--><span style="mso-list:Ignore">3.<span

                    style="font:7.0pt "Times New Roman"">      

                  </span></span><!--[endif]-->All tile register class

                share the same register unit. We do register allocation

                by the framework, and the code is transformed as this.<o:p></o:p></p>

              <p class="MsoNormal"><i>  $tmm0  = TILELOADDV %17:gr64, 1,

                  %18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>

              <p class="MsoNormal"><i>  $tmm1 = TILELOADDV %17:gr64, 1,

                  %18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>

              <p class="MsoNormal"><i>  $tmm2 = TILELOADDV %17:gr64, 1,

                  %18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>

              <p class="MsoNormal"><i>$tmm2 = TDPBSSDV $tmm2(tied-def

                  0), $tmm0, $tmm1</i><o:p></o:p></p>

              <p class="MsoListParagraph"

                style="margin-left:.5in;text-indent:-.25in;mso-list:l3

                level1 lfo6">

                <!--[if !supportLists]--><span style="mso-list:Ignore">4.<span

                    style="font:7.0pt "Times New Roman"">      

                  </span></span><!--[endif]-->Run config pass to collect

                the shape of each physical tile register and config

                them. The code can be generated as below. Here is the

                problem, how can we know the shape of the physical tile

                register?<o:p></o:p></p>

              <p class="MsoNormal"><b><i>   MOV row, col info to

                    %stack.0 for each physical tile register   ??????</i></b><o:p></o:p></p>

              <p class="MsoNormal"><b><i>  LDTILECFG %stack.0, 1,

                    $noreg, 0, $noreg, implicit-def $tmm0, implicit-def

                    $tmm1, implicit-def $tmm2, implicit-def $tmm3,

                    implicit-def $tmm4, implicit-def $tmm5, implicit-def

                    $tmm6, implicit-def $tmm7</i></b><o:p></o:p></p>

              <p class="MsoNormal"><i>  $tmm0  = TILELOADDV %17:gr64, 1,

                  %18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>

              <p class="MsoNormal"><i>  $tmm1 = TILELOADDV %17:gr64, 1,

                  %18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>

              <p class="MsoNormal"><i>  $tmm2 = TILELOADDV %17:gr64, 1,

                  %18:gr64_nosp, 0, $noreg</i><o:p></o:p></p>

              <p class="MsoNormal"><i>$tmm2 = TDPBSSDV $tmm2(tied-def

                  0), $tmm0, $tmm1</i><o:p></o:p></p>

              <p class="MsoNormal"> <o:p></o:p></p>

              <p class="MsoNormal">Thanks<o:p></o:p></p>

              <p class="MsoNormal">Yuanke<o:p></o:p></p>

              <p class="MsoNormal"> <o:p></o:p></p>

              <p class="MsoNormal"

                style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">

                ... <o:p></o:p></p>

            </div>

          </blockquote>

          <pre>-- <o:p></o:p></pre>

          <pre>Hal Finkel<o:p></o:p></pre>

          <pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>

          <pre>Leadership Computing Facility<o:p></o:p></pre>

          <pre>Argonne National Laboratory<o:p></o:p></pre>

        </blockquote>

        <pre>-- <o:p></o:p></pre>

        <pre>Hal Finkel<o:p></o:p></pre>

        <pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>

        <pre>Leadership Computing Facility<o:p></o:p></pre>

        <pre>Argonne National Laboratory<o:p></o:p></pre>

      </div>

    </blockquote>

    <pre class="moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

  </body>

</html>