[llvm-dev] Intel AMX programming model discussion.

Hal Finkel via llvm-dev llvm-dev at lists.llvm.org
Thu Aug 20 14:37:47 PDT 2020


On 8/20/20 3:50 PM, Topper, Craig wrote:
>
> Ignore my spill comment for now. That’s more of an optimization.
>
> Lets say I have a 2x3 tile a 3x2 tile and I multiply them to make a 
> 2x2 tile. I have 3 different sizes of tiles. So my instruction uses 3 
> different register classes for its virtual registers.
>
> The pass that inserts the ldtilecfg needs to configure the physical 
> tiles so lets say it configures tmm0 to 2x3, tmm1 to 3x2 and tmm2 to 2x2.
>
> Register classes as I know them in llvm have a static list of physical 
> registers in them. So all 3 of the register classes for my virtual 
> registers contain all 8 physical tmm registers? How does the register 
> allocator know to use tmm0 for the 2x3 virtual register, and tmm1 for 
> the 3x2 virtual register, and tmm2 for the 2x2 virtual register.
>
> ~Craig
>

Ah, okay. I think I see why we're not on the same page. The 
architectural definition has 8 files registers, tmm0-tmm7, but I was 
thinking that you would not model it that way. Instead, we could have 
registers:

tmm0_1x1 ... tmm7_1x1

...

tmm0_16x16 ... tmm7_16x16

where tmm0_1x1 as aliases of tmm0_1x2, ... tmm0_16x16, and so on.

and corresponding register classes RegClassTmm1x1, ..., RegClassTmm16x16 
(I don't mean to imply this exact naming convention). So, within each 
region, you assign the relevant virtual registers to have a register 
class of RegClassTmm1x1, or whatever, and then once register allocation 
is done, you adjust the ldtilecfg data for each region so that it 
actually makes whatever registers were assigned by the right tile sizes.

You would not want to have N^2 version of all of the instructions 
either, but I think you can just have the instructions defined to take 
some overall register class (containing all of the registers) and then 
you can call constrainRegClass in the configuration-placement pass.

Thinking about it however, maybe having the different physical registers 
isn't actually needed. If you know which tile config each register 
needed based on the instructions, maybe you can have only 8 of them and 
just update the ldtilecfg based on the usage information after 
allocation regardless.

  -Hal


> *From:* Hal Finkel <hfinkel at anl.gov>
> *Sent:* Thursday, August 20, 2020 1:27 PM
> *To:* Topper, Craig <craig.topper at intel.com>; Kaylor, Andrew 
> <andrew.kaylor at intel.com>; Luo, Yuanke <yuanke.luo at intel.com>; Philip 
> Reames <listmail at philipreames.com>; llvm-dev at lists.llvm.org; 
> florian_hahn at apple.com; Lu, Hongjiu <hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> On 8/20/20 2:47 PM, Topper, Craig wrote:
>
>     I think I’m still missing something here. The configuration is per
>     tile. The multiply instructions take a MxK tile and multiply it by
>     a KxN tile and accumulate into an MxN tile. So the configuration
>     needs to know how many of each size of tile it needs to avoid a
>     spill. Wouldn’t the register allocator then need to know which
>     physical tiles have been configured to which sizes so that it only
>     chooses those tiles for an operand that needs that size?
>
> Yes, I think so. But it will because that information is essentially 
> encoded in the virtual register classes. I certainly could be missing 
> something. It seems like you first figure that out, and then you 
> assign virtual tile registers corresponding to the correct tile sizes. 
> Perhaps this comes down to what you mean by "avoid a spill." We still 
> might spill, and I assume that the infrastructure always needs to deal 
> with that. We should continue to do instruction scheduling in order to 
> minimize register pressure. Once we assign the right virtual register 
> classes to the AMX instructions, shouldn't this automatically happen? 
> If we do spill, since none of the original live ranges cross the 
> ldtilecfg, then there shouldn't be any fundamental issue with using a 
> regular load/store spill implementation.
>
> I'm definitely not an expert in this instruction set, so I may just 
> not understand some aspect of this. If there's something I'm 
> overlooking, a little example would be helpful.
>
> Thanks again,
>
> Hal
>
>     ~Craig
>
>     *From:* Hal Finkel <hfinkel at anl.gov> <mailto:hfinkel at anl.gov>
>     *Sent:* Thursday, August 20, 2020 12:35 PM
>     *To:* Topper, Craig <craig.topper at intel.com>
>     <mailto:craig.topper at intel.com>; Kaylor, Andrew
>     <andrew.kaylor at intel.com> <mailto:andrew.kaylor at intel.com>; Luo,
>     Yuanke <yuanke.luo at intel.com> <mailto:yuanke.luo at intel.com>;
>     Philip Reames <listmail at philipreames.com>
>     <mailto:listmail at philipreames.com>; llvm-dev at lists.llvm.org
>     <mailto:llvm-dev at lists.llvm.org>; florian_hahn at apple.com
>     <mailto:florian_hahn at apple.com>; Lu, Hongjiu
>     <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
>     *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
>     On 8/19/20 3:09 PM, Topper, Craig wrote:
>
>         The width and height can be runtime values that we would just
>         copy into 64 byte configuration block we pass to ldtilecfg. So
>         the code doesn’t need to be multiversioned. The user code
>         would also use those values to update pointers in the loops
>         they write using the tiles. If we can’t determine that two
>         tiles were defined with the same width and height we need to
>         assume the shape is different and try to avoid ever giving the
>         same tile.
>
>         Hal, for your suggestion would which physical registers are in
>         which register class be defined dynamically before register
>         allocation?
>
>     Here's my thought:
>
>     First, you have a set of intrinsics that take tile values along
>     with tile configuration parameters (which, presently, seem just to
>     be the sizes). These get lowered into pseudo-instructions that do
>     the same. Thus, you have some register class that represents these
>     arbitrarily-sized tile registers that you'll assign to these
>     pseudo-instruction operands (i.e., they take virtual tile
>     registers right after instruction selection). You might use the
>     16x16 tile register class for this purpose, but it shouldn't
>     really matter.
>
>     Second, you run this configuration-placement pass. This pass looks
>     at all of the AMX pseudo-instructions and identifies regions in
>     which the pseudo-instructions use the same configuration
>     parameters (i.e., the same SSA values and/or constants). This pass
>     might reorder the pseudo-instructions when legal in order to form
>     larger regions. Then it places the ldtilecfg at the start of each
>     region (in some common dominating position). ldtilecfg implicitly
>     defines all of the tile registers in every concrete class of tile
>     registers (all 256 of them, or whatever). The pseudo-instructions
>     are replaced by real MI instructions taking a tile register class
>     appropriate for the configuration (which will default to the 16x16
>     class for cases where the configuration is not a
>     compile-time-known constant). When the configuration is a known
>     constant, the instructions take operands with a register class
>     appropriate for that configuration (e.g., 1x1, 4x4).
>
>     Third, the rest of the framework runs as usual. Tile registers
>     from the appropriate class are allocated by the register
>     allocator. No live range of any virtual tile register can pass
>     through the ldtilecfg (because it defines them all), but that's
>     okay, none of live ranges will by construction (the
>     configuration-placement pass ensures this).
>
>     Thanks again,
>
>     Hal
>
>         *From:* Hal Finkel <hfinkel at anl.gov> <mailto:hfinkel at anl.gov>
>         *Sent:* Wednesday, August 19, 2020 12:52 PM
>         *To:* Kaylor, Andrew <andrew.kaylor at intel.com>
>         <mailto:andrew.kaylor at intel.com>; Luo, Yuanke
>         <yuanke.luo at intel.com> <mailto:yuanke.luo at intel.com>; Philip
>         Reames <listmail at philipreames.com>
>         <mailto:listmail at philipreames.com>; llvm-dev at lists.llvm.org
>         <mailto:llvm-dev at lists.llvm.org>; florian_hahn at apple.com
>         <mailto:florian_hahn at apple.com>; Topper, Craig
>         <craig.topper at intel.com> <mailto:craig.topper at intel.com>; Lu,
>         Hongjiu <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
>         *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
>         On 8/19/20 10:24 AM, Kaylor, Andrew wrote:
>
>             > When the tile shape is unknown at compile time, how do
>             you plan to do the register allocation of the tiles? My
>             question is: do you do the allocation for this case in the
>             same way as you would if you knew the size was 16x16
>             (i.e., conservatively assume the largest size)?
>
>             I think what will happen is that the registers are
>             allocated based on a number of runtime values that are
>             assumed to be different from one another but less than or
>             equal to 16. So, for example, we’ll allocate registers for
>             MxN tiles, NxM tiles and MxM tiles without knowing what M
>             and N are. Then at runtime the values of these variables
>             will be used to create the actual tile configuration. The
>             instructions that need to know the shape take these
>             runtime values as operands.
>
>         So you're going to multiversion the code?
>
>         In any case, my point is that you probably don't need a custom
>         register allocator. If you just define the tile registers and
>         make sure that the ldtilecfgs implicitly defines them all,
>         then the regular infrastructure likely works. You'll have a
>         bunch of register classes, but that's not necessarily a
>         problem. I recommend trying this, and let us know what you
>         discover, before we go down the road of a new, dedicated
>         allocator just for these registers.
>
>          -Hal
>
>             There may be some artifacts coming from the front end that
>             conservatively assume a 16x16 tile, but I think those
>             generally go away in SROA or later specialized passes.
>             Yuanke can confirm or correct my understanding of this.
>
>             *From:* Hal Finkel <hfinkel at anl.gov> <mailto:hfinkel at anl.gov>
>             *Sent:* Wednesday, August 19, 2020 5:14 AM
>             *To:* Luo, Yuanke <yuanke.luo at intel.com>
>             <mailto:yuanke.luo at intel.com>; Kaylor, Andrew
>             <andrew.kaylor at intel.com>
>             <mailto:andrew.kaylor at intel.com>; Philip Reames
>             <listmail at philipreames.com>
>             <mailto:listmail at philipreames.com>;
>             llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>;
>             florian_hahn at apple.com <mailto:florian_hahn at apple.com>;
>             Topper, Craig <craig.topper at intel.com>
>             <mailto:craig.topper at intel.com>; Lu, Hongjiu
>             <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
>             *Subject:* Re: [llvm-dev] Intel AMX programming model
>             discussion.
>
>             On 8/19/20 5:34 AM, Luo, Yuanke wrote:
>
>                 There is no problem to have 256 register classes. Just
>                 a lot of register classes to me.
>
>                 We don’t assume the shape of each physical register be
>                 16x16, it is defined by user. For variable shape, I
>                 mean the shape is known in runtime and in compile time
>                 the shape is unknown. Take below code as an example,
>                 the %row and %col are variable instead of constant.
>                 Compiler recognizes llvm.x86.tileloadd64 and deduce
>                 the shape of %0 is %row x %col.
>
>                 %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16
>                 %row, i16 %col, i8* getelementptr inbounds ([1024 x
>                 i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)
>
>             When the tile shape is unknown at compile time, how do you
>             plan to do the register allocation of the tiles? My
>             question is: do you do the allocation for this case in the
>             same way as you would if you knew the size was 16x16
>             (i.e., conservatively assume the largest size)?
>
>             Thanks again,
>
>             Hal
>
>                 *From:* Hal Finkel <hfinkel at anl.gov>
>                 <mailto:hfinkel at anl.gov>
>                 *Sent:* Wednesday, August 19, 2020 4:58 PM
>                 *To:* Luo, Yuanke <yuanke.luo at intel.com>
>                 <mailto:yuanke.luo at intel.com>; Kaylor, Andrew
>                 <andrew.kaylor at intel.com>
>                 <mailto:andrew.kaylor at intel.com>; Philip Reames
>                 <listmail at philipreames.com>
>                 <mailto:listmail at philipreames.com>;
>                 llvm-dev at lists.llvm.org
>                 <mailto:llvm-dev at lists.llvm.org>;
>                 florian_hahn at apple.com
>                 <mailto:florian_hahn at apple.com>; Topper, Craig
>                 <craig.topper at intel.com>
>                 <mailto:craig.topper at intel.com>; Lu, Hongjiu
>                 <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
>                 *Subject:* Re: [llvm-dev] Intel AMX programming model
>                 discussion.
>
>                 On 8/19/20 2:21 AM, Luo, Yuanke wrote:
>
>                     Hi Hal,
>
>                     There is 3 aspect to be solved.
>
>                     1.The HW support max shape 16x16, so there are
>                     many register classes from 1x1 to 16x16. We need
>                     256 register classes.
>
>                     2.We want to support variable shape, so compiler
>                     don’t know what register class to fit tile shape
>                     as it is only known in runtime.
>
>                     3.The tile configure is to configure physical tile
>                     register, so we need to allocate register and then
>                     we know the shape of each physical tile register
>                     and configure the tile register.
>
>                     I think your suggestion is helpful to reduce the
>                     complexity if we only support fixed (constant)
>                     tile shape.
>
>                     -Yuanke
>
>                 Thanks, Yuanke.
>
>                 It's not clear to me that having 256 register classes
>                 is, in itself, a problem. Is it?
>
>                 What does it mean to support variable-shape tiles in
>                 this context? Do you do something other than
>                 conservatively assume that they are 16x16 for
>                 register-allocation purposes?
>
>                  -Hal
>
>                     *From:* Hal Finkel <hfinkel at anl.gov>
>                     <mailto:hfinkel at anl.gov>
>                     *Sent:* Wednesday, August 19, 2020 8:20 AM
>                     *To:* Kaylor, Andrew <andrew.kaylor at intel.com>
>                     <mailto:andrew.kaylor at intel.com>; Philip Reames
>                     <listmail at philipreames.com>
>                     <mailto:listmail at philipreames.com>; Luo, Yuanke
>                     <yuanke.luo at intel.com>
>                     <mailto:yuanke.luo at intel.com>;
>                     llvm-dev at lists.llvm.org
>                     <mailto:llvm-dev at lists.llvm.org>;
>                     florian_hahn at apple.com
>                     <mailto:florian_hahn at apple.com>; Topper, Craig
>                     <craig.topper at intel.com>
>                     <mailto:craig.topper at intel.com>; Lu, Hongjiu
>                     <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
>                     *Subject:* Re: [llvm-dev] Intel AMX programming
>                     model discussion.
>
>                     Hi, Andy,
>
>                     I don't quite understand everything that's going
>                     on here. Could we model this as:
>
>                      1. Define a collection of register classes, one
>                     for 2x4 tiles, one for 4x2 tiles, etc. each
>                     populated with a set of tile registers. Registers
>                     can have aliasing relationships (instead of
>                     worrying of any kind of subregister/superregister
>                     relationships -- these won't be useful anyway).
>
>                      2. Define the tile-configuration instructions so
>                     that they implicitly define all of the registers
>                     in all of the classes.
>
>                     Then you would still need to pre-schedule the tile
>                     operations as you've described, and collect the
>                     configuration information in order to add the
>                     ldtilecfgs, but the regular register allocator can
>                     handle the allocation itself in the usual way.
>                     What do you think?
>
>                      -Hal
>
>                     On 8/18/20 6:58 PM, Kaylor, Andrew via llvm-dev wrote:
>
>                         The AMX registers are complicated. The single
>                         configuration register (which is mostly used
>                         implicitly, similar to MXCSR for floating
>                         point) controls the shape of all the tile
>                         registers, and if you change the tile
>                         configuration every single tile register is
>                         cleared. In practice, if we have to change the
>                         the configuration while any of the tile
>                         registers are live, performance is going to be
>                         terrible. We need to handle this case for
>                         correctness, but users of this programming
>                         interface will need to have enough awareness
>                         of the performance issues and the hardware
>                         details to prevent this. We’ll also want a
>                         diagnostic that lets the user know when this
>                         has happened.
>
>                         When the tile configuration is set, the shape
>                         of each tile is locked in, so the individual
>                         tile registers aren’t interchangeable at that
>                         point. If a function needs 2x4 tiles, 4x2
>                         tiles, and 4x4 tiles, the configuration needs
>                         to be set with this in mind. The shape isn’t
>                         explicit in every instruction and intrinsic.
>                         It must be deduced. And again, we’ll need a
>                         way to tell the user when efficient allocation
>                         can’t be done. In practice, I don’t expect any
>                         function to be using more than three tile shapes.
>
>                         The implication of all this is that I don’t
>                         think the greedy register allocator is well
>                         suited to figure all of this out. We need a
>                         special pass to pre-allocate these registers.
>                         If the function is written in a way that makes
>                         good performance possible, it should be a
>                         relatively simple task to allocate everything
>                         with minimal spilling. If it isn’t possible to
>                         get good performance, we don’t need to do
>                         anything especially clever. We can just do
>                         something straightforward that is correct and
>                         let the user know that they aren’t going to be
>                         happy with the results.
>
>                         -Andy
>
>                         *From:* Philip Reames
>                         <listmail at philipreames.com>
>                         <mailto:listmail at philipreames.com>
>                         *Sent:* Friday, August 14, 2020 8:29 PM
>                         *To:* Luo, Yuanke <yuanke.luo at intel.com>
>                         <mailto:yuanke.luo at intel.com>;
>                         llvm-dev at lists.llvm.org
>                         <mailto:llvm-dev at lists.llvm.org>;
>                         florian_hahn at apple.com
>                         <mailto:florian_hahn at apple.com>; Kaylor,
>                         Andrew <andrew.kaylor at intel.com>
>                         <mailto:andrew.kaylor at intel.com>; Topper,
>                         Craig <craig.topper at intel.com>
>                         <mailto:craig.topper at intel.com>; Lu, Hongjiu
>                         <hongjiu.lu at intel.com>
>                         <mailto:hongjiu.lu at intel.com>
>                         *Subject:* Re: [llvm-dev] Intel AMX
>                         programming model discussion.
>
>                         I find your answer unconvincing.  I'm not
>                         going to debate it as I don't wish to take the
>                         time to build the appropriate context, but my
>                         initial response is skepticism.
>
>                         Philip
>
>                         On 8/14/20 4:49 PM, Luo, Yuanke wrote:
>
>                             [Yuanke] AMX register is special. It needs
>                             to be configured before use and the config
>                             instruction is expensive. To avoid
>                             unnecessary tile configure, we collect the
>                             tile shape information as much as possible
>                             and combine them into one ldtilecfg
>                             instruction. The ldtilecfg instruction
>                             should dominate any AMX instruction that
>                             access tile register. On the other side,
>                             the ldtilecfg should post-dominated the
>                             instruction that define the tile shape.
>                             For tile register spill, it should avoid
>                             re-config due to the different tile shape,
>                             the spilled register should be reloaded to
>                             the register that share the same tile
>                             shape. Since tile register allocation is
>                             special and it may allocate general
>                             virtual register to configure tile
>                             register, we can add a sperate pass to do
>                             it before general register allocation
>                             pass. After register allocation, the tile
>                             shape information is not needed anymore,
>                             so we can transform the pseudo AMX
>                             instruction to real AMX instruction by
>                             removing the row and column operands.
>
>                             [Philip]
>
>                             This seems complicated.
>
>                             Reading through the documentation, there
>                             appears to be a single global tile config
>                             for all tile registers at any time.
>
>                             Why not simply model this tile config as a
>                             designated special register and the tile
>                             instructions as having an implicit use of
>                             this register?  That would seem to ensure
>                             that the register allocator has all the
>                             constraints needed.  You'd need to teach
>                             it how to spill the special registers with
>                             the appropriate instructions, but that
>                             seems a lot more straight forward?
>
>                             [Yuanke] In that case user need to
>                             configure the tile register by themselves.
>                             Spilling configure register is very
>                             expensive, because it clears all the tile
>                             data register to zero. In our proposal,
>                             compiler is responsible to deduce the
>                             shape for virtual of tile data register,
>                             allocate physical registers for them and
>                             then configure those physical register. We
>                             may build the dependency as you proposed
>                             and it can be used for machine IR check to
>                             ensure tile data register is configured
>                             before use.
>
>                             *From:* Philip Reames
>                             <listmail at philipreames.com>
>                             <mailto:listmail at philipreames.com>
>                             *Sent:* Saturday, August 15, 2020 1:17 AM
>                             *To:* Luo, Yuanke <yuanke.luo at intel.com>
>                             <mailto:yuanke.luo at intel.com>;
>                             llvm-dev at lists.llvm.org
>                             <mailto:llvm-dev at lists.llvm.org>;
>                             florian_hahn at apple.com
>                             <mailto:florian_hahn at apple.com>; Kaylor,
>                             Andrew <andrew.kaylor at intel.com>
>                             <mailto:andrew.kaylor at intel.com>; Topper,
>                             Craig <craig.topper at intel.com>
>                             <mailto:craig.topper at intel.com>; Lu,
>                             Hongjiu <hongjiu.lu at intel.com>
>                             <mailto:hongjiu.lu at intel.com>
>                             *Subject:* Re: [llvm-dev] Intel AMX
>                             programming model discussion.
>
>                             On 8/14/20 6:27 AM, Luo, Yuanke via
>                             llvm-dev wrote:
>
>                                 Hi,
>
>                                 Intel Advanced Matrix Extensions
>                                 (Intel AMX) is a new programming
>                                 paradigm consisting of two components:
>                                 a set of 2-dimensional registers
>                                 (tiles) representing sub-arrays from a
>                                 larger 2-dimensional memory image, and
>                                 accelerators able to operate on tiles.
>                                 Capability of Intel AMX implementation
>                                 is enumerated by palettes. Two
>                                 palettes are supported: palette 0
>                                 represents the initialized state and
>                                 palette 1 consists of 8 tile registers
>                                 of up to 1 KB size, which is
>                                 controlled by a tile control register.
>
>                                 The instruction manual is posted at
>                                 https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
>                                 <https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html>.
>
>                                 The AMX abi proposal is posted at
>                                 https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4
>                                 <https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4>.
>
>                                 This email is to discuss the
>                                 programming model for AMX. Florian has
>                                 introduced the matrix type and
>                                 intrinsics in LLVM community. We’d
>                                 like to adopt some ideas from it.
>
>                                 Here is what we propose for the AMX
>                                 programming model.
>
>                                 1. Data type.
>
>                                 We’d like to have fixed vector type
>                                 for AMX. Since the shape to AMX
>                                 register can be configurable, the
>                                 vector size is the maximum size of AMX
>                                 register. That means the vector size
>                                 is 1024 bytes.
>
>                                 The C code may look like this.
>
>                                 typedef int _tile_data
>                                 __attribute__((__vector_size__(1024),
>                                 __aligned__(64)));
>
>                                 _tile_data tile;
>
>                                 And the LLVM IR may look like this.
>
>                                 @tile = dso_local local_unnamed_addr
>                                 global <256 x i32> zeroinitializer,
>                                 align 64
>
>                                 For llvm IR, it is nice to have a new
>                                 type x86_amxtile that can be mapped to
>                                 AMX registers.
>
>                                 2.AMX Intrinsics.
>
>                                 The internal intrinsics are 1:1 mapped
>                                 to AMX instructions. The parameter m,
>                                 n, k identifies the shape of the tile.
>                                 The shape can be variable, but it
>                                 cannot exceed the size that AMX HW can
>                                 support. Compiler can deduce shape of
>                                 the tile from the AMX intrinsics.
>
>                                 _tile_data _tile_loadd_internal(char
>                                 m, short n, const void *base, int stride);
>
>                                 _tile_data _tile_dpbssd_internal(char
>                                 m, short n, short k, _tile_data dst,
>                                 _tile_data src1, _tile_data src2);
>
>                                 _tile_data
>                                 _tile_dpbf16ps_internal(char m, short
>                                 n, short k, _tile_data dst, _tile_data
>                                 src1, _tile_data src2);
>
>                                 void _tile_stored_internal(char m,
>                                 short n, void *base, int stride,
>                                 _tile_data tile);
>
>                                 3.User interfaces.
>
>                                 The tile shape and tile data are
>                                 combined into a struct in C language.
>                                 The shape of the tile is only allowed
>                                 to be initialized once. The user
>                                 interface looks as this.
>
>                                    3  #define __DEFAULT_FN_AMX    \
>
>                                    4 __attribute__((__always_inline__,
>                                 __nodebug__, __target__("amx-int8")))
>
>                                    9 typedef struct __tile_str {
>
>                                 10   const char row;
>
>                                 11   const short col;
>
>                                 12   _tile_data tile;
>
>                                 13 }__tile;
>
>                                 14
>
>                                 15 __DEFAULT_FN_AMX
>
>                                 16 void __tile_loadd(__tile *dst,
>                                 const void *base, long stride) {
>
>                                 17   dst->tile =
>                                 _tile_loadd_internal(dst->row,
>                                 dst->col, base, stride);
>
>                                 18 }
>
>                                 19
>
>                                 20 __DEFAULT_FN_AMX
>
>                                 21 void __tile_dpbsud(__tile *dst,
>                                 __tile src1, __tile src2) {
>
>                                 22   dst->tile =
>                                 _tile_dpbssd_internal(src1.row,
>                                 src2.col, src1.col, dst->tile,
>                                 src1.tile, src2.tile);
>
>                                 23 }
>
>                                 24
>
>                                 25 __DEFAULT_FN_AMX
>
>                                 26 void __tile_stored(void *base, long
>                                 stride, __tile src) {
>
>                                 27 _tile_stored_internal(src.row,
>                                 src.col, base, stride, src.tile);
>
>                                 28 }
>
>                                 4.Example code
>
>                                 The example shows how to use the user
>                                 interface in a function.
>
>                                  51 void api(int cond, short row,
>                                 short col) {
>
>                                 52   __tile a = {row, col};
>
>                                 53   __tile b = {row, col};
>
>                                 54   __tile c = {row, col};
>
>                                 55
>
>                                 56   if(cond) {
>
>                                 57     __tile_loadd(&a, buf, STRIDE);
>
>                                 58     __tile_loadd(&b, buf, STRIDE);
>
>                                 59     __tile_loadd(&c, buf, STRIDE);
>
>                                 60   } else {
>
>                                 61     __tile_loadd(&a, buf2, STRIDE);
>
>                                 62     __tile_loadd(&b, buf2, STRIDE);
>
>                                 63     __tile_loadd(&c, buf2, STRIDE);
>
>                                 64   }
>
>                                 65 __tile_dpbsud(&c, a, b);
>
>                                 66   __tile_stored(buf, STRIDE, c);
>
>                                 67 }
>
>                                 5.LLVM IR
>
>                                 The LLVM intrinsics IR take the row
>                                 and column information as the input
>                                 parameter, so that compiler can deduce
>                                 the shape of tile data. The remaining
>                                 parameters are what AMX instructions
>                                 require. This is the LLVM IR
>                                 corresponding to the example code.
>
>                                 12 define dso_local void @api(i32
>                                 %cond, i16 signext %row, i16 signext
>                                 %col) local_unnamed_addr #2 {
>
>                                 13 entry:
>
>                                 14   %tobool = icmp eq i32 %cond, 0
>
>                                 15   %sext = shl i16 %col, 8
>
>                                 16   %conv.i31 = ashr exact i16 %sext, 8
>
>                                 17   br i1 %tobool, label %if.else,
>                                 label %if.then
>
>                                 18
>
>                                 19 if.then: ; preds = %entry
>
>                                 20   %0 = tail call <256 x i32>
>                                 @llvm.x86.tileloadd64(i16 %row, i16
>                                 %conv.i31, i8* getelementptr inbounds
>                                 ([1024 x i8], [1024 x i8]* @buf, i64
>                                 0, i64 0), i64 32) #3
>
>                                 21   %1 = tail call <256 x i32>
>                                 @llvm.x86.tileloadd64(i16 %row, i16
>                                 %conv.i31, i8* getelementptr inbounds
>                                 ([1024 x i8], [1024 x i8]* @buf, i64
>                                 0, i64 0), i64 32) #3
>
>                                 22   %2 = tail call <256 x i32>
>                                 @llvm.x86.tileloadd64(i16 %row, i16
>                                 %conv.i31, i8* getelementptr inbounds
>                                 ([1024 x i8], [1024 x i8]* @buf, i64
>                                 0, i64 0), i64 32) #3
>
>                                 23   br label %if.end
>
>                                 24
>
>                                 25 if.else:                     ;
>                                 preds = %entry
>
>                                 26   %3 = tail call <256 x i32>
>                                 @llvm.x86.tileloadd64(i16 %row, i16
>                                 %conv.i31, i8* getelementptr inbounds
>                                 ([1024 x i8], [1024 x i8]* @buf2, i64
>                                 0, i64 0), i64 32) #3
>
>                                 27   %4 = tail call <256 x i32>
>                                 @llvm.x86.tileloadd64(i16 %row, i16
>                                 %conv.i31, i8* getelementptr inbounds
>                                 ([1024 x i8], [1024 x i8]* @buf2, i64
>                                 0, i64 0), i64 32) #3
>
>                                 28   %5 = tail call <256 x i32>
>                                 @llvm.x86.tileloadd64(i16 %row, i16
>                                 %conv.i31, i8* getelementptr inbounds
>                                 ([1024 x i8], [1024 x i8]* @buf2, i64
>                                 0, i64 0), i64 32) #3
>
>                                 29   br label %if.end
>
>                                 30
>
>                                 31 if.end: ; preds = %if.else, %if.then
>
>                                 32   %a.sroa.1186.0 = phi <256 x i32>
>                                 [ %3, %if.else ], [ %0, %if.then ]
>
>                                 33   %b.sroa.1068.0 = phi <256 x i32>
>                                 [ %4, %if.else ], [ %1, %if.then ]
>
>                                 34   %c.sroa.1149.0 = phi <256 x i32>
>                                 [ %5, %if.else ], [ %2, %if.then ]
>
>                                 35   %6 = tail call <256 x i32>
>                                 @llvm.x86.tdpbssd(i16 %row, i16
>                                 %conv.i31, i16 %conv.i31, <256 x i32>
>                                 %c.sroa.1149.0, <256 x i32>
>                                 %a.sroa.1186.0, <256 x i32>
>                                 %b.sroa.1068.0) #3
>
>                                 36   tail call void
>                                 @llvm.x86.tilestored64(i16 %row, i16
>                                 %conv.i31, i8* getelementptr inbounds
>                                 ([1024 x i8], [1024 x i8]* @buf, i64
>                                 0, i64 0), i64 32, <256 x i32> %6) #3
>
>                                 37   ret void
>
>                                 38 }
>
>                                 6.Shape propagation
>
>                                 When in -O0 build, some general
>                                 load/store for tile vector is
>                                 generated by front-end. We need to
>                                 root from AMX intrinsics to propagate
>                                 the shape information to the virtual
>                                 tile register. If the an AMX intrinsic
>                                 use the result of load instruction,
>                                 the shape is propagated to the load
>                                 and the load is transformed to tile
>                                 load intrinsic. If the store
>                                 instruction uses any result of AMX
>                                 intrinsic, the shape is propagated to
>                                 store instruction and the store is
>                                 transformed to tile store intrinsic
>
>                                 7.Machine IR
>
>                                 Since the AMX intrinsics take the row
>                                 and column as the input parameters, we
>                                 can create a pseudo instruction
>                                 corresponding to it. The AMX
>                                 intrinsics are lowered to the pseudo
>                                 AMX instruction which has extra row
>                                 and column operands corresponding to
>                                 AMX intrinsic. The real AMX
>                                 instructions don’t need the row and
>                                 column operands. The row and column
>                                 information should be configured by
>                                 ldtilecfg before executing any AMX
>                                 instruction.
>
>                                 8.Register allocation
>
>                                 AMX register is special. It needs to
>                                 be configured before use and the
>                                 config instruction is expensive. To
>                                 avoid unnecessary tile configure, we
>                                 collect the tile shape information as
>                                 much as possible and combine them into
>                                 one ldtilecfg instruction. The
>                                 ldtilecfg instruction should dominate
>                                 any AMX instruction that access tile
>                                 register. On the other side, the
>                                 ldtilecfg should post-dominated the
>                                 instruction that define the tile
>                                 shape. For tile register spill, it
>                                 should avoid re-config due to the
>                                 different tile shape, the spilled
>                                 register should be reloaded to the
>                                 register that share the same tile
>                                 shape. Since tile register allocation
>                                 is special and it may allocate general
>                                 virtual register to configure tile
>                                 register, we can add a sperate pass to
>                                 do it before general register
>                                 allocation pass. After register
>                                 allocation, the tile shape information
>                                 is not needed anymore, so we can
>                                 transform the pseudo AMX instruction
>                                 to real AMX instruction by removing
>                                 the row and column operands.
>
>                             This seems complicated.
>
>                             Reading through the documentation, there
>                             appears to be a single global tile config
>                             for all tile registers at any time.
>
>                             Why not simply model this tile config as a
>                             designated special register and the tile
>                             instructions as having an implicit use of
>                             this register?  That would seem to ensure
>                             that the register allocator has all the
>                             constraints needed.  You'd need to teach
>                             it how to spill the special registers with
>                             the appropriate instructions, but that
>                             seems a lot more straight forward?
>
>                                 9.Use recommendation
>
>                                 Due to the shape configure issue, we
>                                 recommend user to define the tile
>                                 shape at the entry of the function
>                                 entry and inline function as much as
>                                 possible. The AMX instructions focus
>                                 on computation instead of storage, so
>                                 global variable for tile data is not
>                                 recommended.
>
>                                 Thanks
>
>                                 Yuanke
>
>
>
>
>
>
>
>
>
>
>                                 _______________________________________________
>
>                                 LLVM Developers mailing list
>
>                                 llvm-dev at lists.llvm.org  <mailto:llvm-dev at lists.llvm.org>
>
>                                 https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev  <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
>
>
>
>
>
>
>                         _______________________________________________
>
>                         LLVM Developers mailing list
>
>                         llvm-dev at lists.llvm.org  <mailto:llvm-dev at lists.llvm.org>
>
>                         https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev  <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>                     -- 
>
>                     Hal Finkel
>
>                     Lead, Compiler Technology and Programming Languages
>
>                     Leadership Computing Facility
>
>                     Argonne National Laboratory
>
>                 -- 
>
>                 Hal Finkel
>
>                 Lead, Compiler Technology and Programming Languages
>
>                 Leadership Computing Facility
>
>                 Argonne National Laboratory
>
>             -- 
>
>             Hal Finkel
>
>             Lead, Compiler Technology and Programming Languages
>
>             Leadership Computing Facility
>
>             Argonne National Laboratory
>
>         -- 
>
>         Hal Finkel
>
>         Lead, Compiler Technology and Programming Languages
>
>         Leadership Computing Facility
>
>         Argonne National Laboratory
>
>     -- 
>
>     Hal Finkel
>
>     Lead, Compiler Technology and Programming Languages
>
>     Leadership Computing Facility
>
>     Argonne National Laboratory
>
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200820/1e28a893/attachment-0001.html>


More information about the llvm-dev mailing list