[llvm-dev] Intel AMX programming model discussion.

Wed Aug 19 05:14:17 PDT 2020

On 8/19/20 5:34 AM, Luo, Yuanke wrote:
>
> There is no problem to have 256 register classes. Just a lot of 
> register classes to me.
>
> We don’t assume the shape of each physical register be 16x16, it is 
> defined by user. For variable shape, I mean the shape is known in 
> runtime and in compile time the shape is unknown. Take below code as 
> an example, the %row and %col are variable instead of constant. 
> Compiler recognizes llvm.x86.tileloadd64 and deduce the shape of %0 is 
> %row x %col.
>
> %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %col, 
> i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 
> 0), i64 32)
>

When the tile shape is unknown at compile time, how do you plan to do 
the register allocation of the tiles? My question is: do you do the 
allocation for this case in the same way as you would if you knew the 
size was 16x16 (i.e., conservatively assume the largest size)?

Thanks again,

Hal

> *From:* Hal Finkel <hfinkel at anl.gov>
> *Sent:* Wednesday, August 19, 2020 4:58 PM
> *To:* Luo, Yuanke <yuanke.luo at intel.com>; Kaylor, Andrew 
> <andrew.kaylor at intel.com>; Philip Reames <listmail at philipreames.com>; 
> llvm-dev at lists.llvm.org; florian_hahn at apple.com; Topper, Craig 
> <craig.topper at intel.com>; Lu, Hongjiu <hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> On 8/19/20 2:21 AM, Luo, Yuanke wrote:
>
>     Hi Hal,
>
>     There is 3 aspect to be solved.
>
>     1.The HW support max shape 16x16, so there are many register
>     classes from 1x1 to 16x16. We need 256 register classes.
>
>     2.We want to support variable shape, so compiler don’t know what
>     register class to fit tile shape as it is only known in runtime.
>
>     3.The tile configure is to configure physical tile register, so we
>     need to allocate register and then we know the shape of each
>     physical tile register and configure the tile register.
>
>     I think your suggestion is helpful to reduce the complexity if we
>     only support fixed (constant) tile shape.
>
>     -Yuanke
>
> Thanks, Yuanke.
>
> It's not clear to me that having 256 register classes is, in itself, a 
> problem. Is it?
>
> What does it mean to support variable-shape tiles in this context? Do 
> you do something other than conservatively assume that they are 16x16 
> for register-allocation purposes?
>
>  -Hal
>
>     *From:* Hal Finkel <hfinkel at anl.gov> <mailto:hfinkel at anl.gov>
>     *Sent:* Wednesday, August 19, 2020 8:20 AM
>     *To:* Kaylor, Andrew <andrew.kaylor at intel.com>
>     <mailto:andrew.kaylor at intel.com>; Philip Reames
>     <listmail at philipreames.com> <mailto:listmail at philipreames.com>;
>     Luo, Yuanke <yuanke.luo at intel.com> <mailto:yuanke.luo at intel.com>;
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>;
>     florian_hahn at apple.com <mailto:florian_hahn at apple.com>; Topper,
>     Craig <craig.topper at intel.com> <mailto:craig.topper at intel.com>;
>     Lu, Hongjiu <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
>     *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
>     Hi, Andy,
>
>     I don't quite understand everything that's going on here. Could we
>     model this as:
>
>      1. Define a collection of register classes, one for 2x4 tiles,
>     one for 4x2 tiles, etc. each populated with a set of tile
>     registers. Registers can have aliasing relationships (instead of
>     worrying of any kind of subregister/superregister relationships --
>     these won't be useful anyway).
>
>      2. Define the tile-configuration instructions so that they
>     implicitly define all of the registers in all of the classes.
>
>     Then you would still need to pre-schedule the tile operations as
>     you've described, and collect the configuration information in
>     order to add the ldtilecfgs, but the regular register allocator
>     can handle the allocation itself in the usual way. What do you think?
>
>      -Hal
>
>     On 8/18/20 6:58 PM, Kaylor, Andrew via llvm-dev wrote:
>
>         The AMX registers are complicated. The single configuration
>         register (which is mostly used implicitly, similar to MXCSR
>         for floating point) controls the shape of all the tile
>         registers, and if you change the tile configuration every
>         single tile register is cleared. In practice, if we have to
>         change the the configuration while any of the tile registers
>         are live, performance is going to be terrible. We need to
>         handle this case for correctness, but users of this
>         programming interface will need to have enough awareness of
>         the performance issues and the hardware details to prevent
>         this. We’ll also want a diagnostic that lets the user know
>         when this has happened.
>
>         When the tile configuration is set, the shape of each tile is
>         locked in, so the individual tile registers aren’t
>         interchangeable at that point. If a function needs 2x4 tiles,
>         4x2 tiles, and 4x4 tiles, the configuration needs to be set
>         with this in mind. The shape isn’t explicit in every
>         instruction and intrinsic. It must be deduced. And again,
>         we’ll need a way to tell the user when efficient allocation
>         can’t be done. In practice, I don’t expect any function to be
>         using more than three tile shapes.
>
>         The implication of all this is that I don’t think the greedy
>         register allocator is well suited to figure all of this out.
>         We need a special pass to pre-allocate these registers. If the
>         function is written in a way that makes good performance
>         possible, it should be a relatively simple task to allocate
>         everything with minimal spilling. If it isn’t possible to get
>         good performance, we don’t need to do anything especially
>         clever. We can just do something straightforward that is
>         correct and let the user know that they aren’t going to be
>         happy with the results.
>
>         -Andy
>
>         *From:* Philip Reames <listmail at philipreames.com>
>         <mailto:listmail at philipreames.com>
>         *Sent:* Friday, August 14, 2020 8:29 PM
>         *To:* Luo, Yuanke <yuanke.luo at intel.com>
>         <mailto:yuanke.luo at intel.com>; llvm-dev at lists.llvm.org
>         <mailto:llvm-dev at lists.llvm.org>; florian_hahn at apple.com
>         <mailto:florian_hahn at apple.com>; Kaylor, Andrew
>         <andrew.kaylor at intel.com> <mailto:andrew.kaylor at intel.com>;
>         Topper, Craig <craig.topper at intel.com>
>         <mailto:craig.topper at intel.com>; Lu, Hongjiu
>         <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
>         *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
>         I find your answer unconvincing.  I'm not going to debate it
>         as I don't wish to take the time to build the appropriate
>         context, but my initial response is skepticism.
>
>         Philip
>
>         On 8/14/20 4:49 PM, Luo, Yuanke wrote:
>
>             [Yuanke] AMX register is special. It needs to be
>             configured before use and the config instruction is
>             expensive. To avoid unnecessary tile configure, we collect
>             the tile shape information as much as possible and combine
>             them into one ldtilecfg instruction. The ldtilecfg
>             instruction should dominate any AMX instruction that
>             access tile register. On the other side, the ldtilecfg
>             should post-dominated the instruction that define the tile
>             shape. For tile register spill, it should avoid re-config
>             due to the different tile shape, the spilled register
>             should be reloaded to the register that share the same
>             tile shape. Since tile register allocation is special and
>             it may allocate general virtual register to configure tile
>             register, we can add a sperate pass to do it before
>             general register allocation pass. After register
>             allocation, the tile shape information is not needed
>             anymore, so we can transform the pseudo AMX instruction to
>             real AMX instruction by removing the row and column operands.
>
>             [Philip]
>
>             This seems complicated.
>
>             Reading through the documentation, there appears to be a
>             single global tile config for all tile registers at any time.
>
>             Why not simply model this tile config as a designated
>             special register and the tile instructions as having an
>             implicit use of this register?  That would seem to ensure
>             that the register allocator has all the constraints
>             needed.  You'd need to teach it how to spill the special
>             registers with the appropriate instructions, but that
>             seems a lot more straight forward?
>
>             [Yuanke] In that case user need to configure the tile
>             register by themselves. Spilling configure register is
>             very expensive, because it clears all the tile data
>             register to zero. In our proposal, compiler is responsible
>             to deduce the shape for virtual of tile data register,
>             allocate physical registers for them and then configure
>             those physical register. We may build the dependency as
>             you proposed and it can be used for machine IR check to
>             ensure tile data register is configured before use.
>
>             *From:* Philip Reames <listmail at philipreames.com>
>             <mailto:listmail at philipreames.com>
>             *Sent:* Saturday, August 15, 2020 1:17 AM
>             *To:* Luo, Yuanke <yuanke.luo at intel.com>
>             <mailto:yuanke.luo at intel.com>; llvm-dev at lists.llvm.org
>             <mailto:llvm-dev at lists.llvm.org>; florian_hahn at apple.com
>             <mailto:florian_hahn at apple.com>; Kaylor, Andrew
>             <andrew.kaylor at intel.com>
>             <mailto:andrew.kaylor at intel.com>; Topper, Craig
>             <craig.topper at intel.com> <mailto:craig.topper at intel.com>;
>             Lu, Hongjiu <hongjiu.lu at intel.com>
>             <mailto:hongjiu.lu at intel.com>
>             *Subject:* Re: [llvm-dev] Intel AMX programming model
>             discussion.
>
>             On 8/14/20 6:27 AM, Luo, Yuanke via llvm-dev wrote:
>
>                 Hi,
>
>                 Intel Advanced Matrix Extensions (Intel AMX) is a new
>                 programming paradigm consisting of two components: a
>                 set of 2-dimensional registers (tiles) representing
>                 sub-arrays from a larger 2-dimensional memory image,
>                 and accelerators able to operate on tiles. Capability
>                 of Intel AMX implementation is enumerated by palettes.
>                 Two palettes are supported: palette 0 represents the
>                 initialized state and palette 1 consists of 8 tile
>                 registers of up to 1 KB size, which is controlled by a
>                 tile control register.
>
>                 The instruction manual is posted at
>                 https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
>                 <https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html>.
>
>                 The AMX abi proposal is posted at
>                 https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4
>                 <https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4>.
>
>                 This email is to discuss the programming model for
>                 AMX. Florian has introduced the matrix type and
>                 intrinsics in LLVM community. We’d like to adopt some
>                 ideas from it.
>
>                 Here is what we propose for the AMX programming model.
>
>                 1. Data type.
>
>                 We’d like to have fixed vector type for AMX. Since the
>                 shape to AMX register can be configurable, the vector
>                 size is the maximum size of AMX register. That means
>                 the vector size is 1024 bytes.
>
>                 The C code may look like this.
>
>                 typedef int _tile_data
>                 __attribute__((__vector_size__(1024), __aligned__(64)));
>
>                 _tile_data tile;
>
>                 And the LLVM IR may look like this.
>
>                 @tile = dso_local local_unnamed_addr global <256 x
>                 i32> zeroinitializer, align 64
>
>                 For llvm IR, it is nice to have a new type x86_amxtile
>                 that can be mapped to AMX registers.
>
>                 2.AMX Intrinsics.
>
>                 The internal intrinsics are 1:1 mapped to AMX
>                 instructions. The parameter m, n, k identifies the
>                 shape of the tile. The shape can be variable, but it
>                 cannot exceed the size that AMX HW can support.
>                 Compiler can deduce shape of the tile from the AMX
>                 intrinsics.
>
>                 _tile_data _tile_loadd_internal(char m, short n, const
>                 void *base, int stride);
>
>                 _tile_data _tile_dpbssd_internal(char m, short n,
>                 short k, _tile_data dst, _tile_data src1, _tile_data
>                 src2);
>
>                 _tile_data _tile_dpbf16ps_internal(char m, short n,
>                 short k, _tile_data dst, _tile_data src1, _tile_data
>                 src2);
>
>                 void _tile_stored_internal(char m, short n, void
>                 *base, int stride, _tile_data tile);
>
>                 3.User interfaces.
>
>                 The tile shape and tile data are combined into a
>                 struct in C language. The shape of the tile is only
>                 allowed to be initialized once. The user interface
>                 looks as this.
>
>                    3  #define __DEFAULT_FN_AMX    \
>
>                    4 __attribute__((__always_inline__, __nodebug__,
>                 __target__("amx-int8")))
>
>                    9 typedef struct __tile_str {
>
>                 10   const char row;
>
>                 11   const short col;
>
>                 12   _tile_data tile;
>
>                 13 }__tile;
>
>                 14
>
>                 15 __DEFAULT_FN_AMX
>
>                 16 void __tile_loadd(__tile *dst, const void *base,
>                 long stride) {
>
>                 17   dst->tile = _tile_loadd_internal(dst->row,
>                 dst->col, base, stride);
>
>                 18 }
>
>                 19
>
>                 20 __DEFAULT_FN_AMX
>
>                 21 void __tile_dpbsud(__tile *dst, __tile src1, __tile
>                 src2) {
>
>                 22   dst->tile = _tile_dpbssd_internal(src1.row,
>                 src2.col, src1.col, dst->tile, src1.tile, src2.tile);
>
>                 23 }
>
>                 24
>
>                 25 __DEFAULT_FN_AMX
>
>                 26 void __tile_stored(void *base, long stride, __tile
>                 src) {
>
>                 27   _tile_stored_internal(src.row, src.col, base,
>                 stride, src.tile);
>
>                 28 }
>
>                 4.Example code
>
>                 The example shows how to use the user interface in a
>                 function.
>
>                  51 void api(int cond, short row, short col) {
>
>                 52   __tile a = {row, col};
>
>                 53   __tile b = {row, col};
>
>                 54   __tile c = {row, col};
>
>                 55
>
>                 56   if(cond) {
>
>                 57     __tile_loadd(&a, buf, STRIDE);
>
>                 58     __tile_loadd(&b, buf, STRIDE);
>
>                 59     __tile_loadd(&c, buf, STRIDE);
>
>                 60   } else {
>
>                 61     __tile_loadd(&a, buf2, STRIDE);
>
>                 62     __tile_loadd(&b, buf2, STRIDE);
>
>                 63     __tile_loadd(&c, buf2, STRIDE);
>
>                 64   }
>
>                 65 __tile_dpbsud(&c, a, b);
>
>                 66   __tile_stored(buf, STRIDE, c);
>
>                 67 }
>
>                 5.LLVM IR
>
>                 The LLVM intrinsics IR take the row and column
>                 information as the input parameter, so that compiler
>                 can deduce the shape of tile data. The remaining
>                 parameters are what AMX instructions require. This is
>                 the LLVM IR corresponding to the example code.
>
>                 12 define dso_local void @api(i32 %cond, i16 signext
>                 %row, i16 signext %col) local_unnamed_addr #2 {
>
>                 13 entry:
>
>                 14   %tobool = icmp eq i32 %cond, 0
>
>                 15   %sext = shl i16 %col, 8
>
>                 16   %conv.i31 = ashr exact i16 %sext, 8
>
>                 17   br i1 %tobool, label %if.else, label %if.then
>
>                 18
>
>                 19 if.then:                                          ;
>                 preds = %entry
>
>                 20   %0 = tail call <256 x i32>
>                 @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
>                 getelementptr inbounds ([1024 x i8], [1024 x i8]*
>                 @buf, i64 0, i64 0), i64 32) #3
>
>                 21   %1 = tail call <256 x i32>
>                 @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
>                 getelementptr inbounds ([1024 x i8], [1024 x i8]*
>                 @buf, i64 0, i64 0), i64 32) #3
>
>                 22   %2 = tail call <256 x i32>
>                 @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
>                 getelementptr inbounds ([1024 x i8], [1024 x i8]*
>                 @buf, i64 0, i64 0), i64 32) #3
>
>                 23   br label %if.end
>
>                 24
>
>                 25 if.else:                     ; preds = %entry
>
>                 26   %3 = tail call <256 x i32>
>                 @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
>                 getelementptr inbounds ([1024 x i8], [1024 x i8]*
>                 @buf2, i64 0, i64 0), i64 32) #3
>
>                 27   %4 = tail call <256 x i32>
>                 @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
>                 getelementptr inbounds ([1024 x i8], [1024 x i8]*
>                 @buf2, i64 0, i64 0), i64 32) #3
>
>                 28   %5 = tail call <256 x i32>
>                 @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
>                 getelementptr inbounds ([1024 x i8], [1024 x i8]*
>                 @buf2, i64 0, i64 0), i64 32) #3
>
>                 29   br label %if.end
>
>                 30
>
>                 31 if.end:                                           ;
>                 preds = %if.else, %if.then
>
>                 32   %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else
>                 ], [ %0, %if.then ]
>
>                 33   %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else
>                 ], [ %1, %if.then ]
>
>                 34   %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else
>                 ], [ %2, %if.then ]
>
>                 35   %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16
>                 %row, i16 %conv.i31, i16 %conv.i31, <256 x i32>
>                 %c.sroa.1149.0, <256 x i32> %a.sroa.1186.0, <256 x
>                 i32> %b.sroa.1068.0) #3
>
>                 36   tail call void @llvm.x86.tilestored64(i16 %row,
>                 i16 %conv.i31, i8* getelementptr inbounds ([1024 x
>                 i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32, <256 x
>                 i32> %6) #3
>
>                 37   ret void
>
>                 38 }
>
>                 6.Shape propagation
>
>                 When in -O0 build, some general load/store for tile
>                 vector is generated by front-end. We need to root from
>                 AMX intrinsics to propagate the shape information to
>                 the virtual tile register. If the an AMX intrinsic use
>                 the result of load instruction, the shape is
>                 propagated to the load and the load is transformed to
>                 tile load intrinsic. If the store instruction uses any
>                 result of AMX intrinsic, the shape is propagated to
>                 store instruction and the store is transformed to tile
>                 store intrinsic
>
>                 7.Machine IR
>
>                 Since the AMX intrinsics take the row and column as
>                 the input parameters, we can create a pseudo
>                 instruction corresponding to it. The AMX intrinsics
>                 are lowered to the pseudo AMX instruction which has
>                 extra row and column operands corresponding to AMX
>                 intrinsic. The real AMX instructions don’t need the
>                 row and column operands. The row and column
>                 information should be configured by ldtilecfg before
>                 executing any AMX instruction.
>
>                 8.Register allocation
>
>                 AMX register is special. It needs to be configured
>                 before use and the config instruction is expensive. To
>                 avoid unnecessary tile configure, we collect the tile
>                 shape information as much as possible and combine them
>                 into one ldtilecfg instruction. The ldtilecfg
>                 instruction should dominate any AMX instruction that
>                 access tile register. On the other side, the ldtilecfg
>                 should post-dominated the instruction that define the
>                 tile shape. For tile register spill, it should avoid
>                 re-config due to the different tile shape, the spilled
>                 register should be reloaded to the register that share
>                 the same tile shape. Since tile register allocation is
>                 special and it may allocate general virtual register
>                 to configure tile register, we can add a sperate pass
>                 to do it before general register allocation pass.
>                 After register allocation, the tile shape information
>                 is not needed anymore, so we can transform the pseudo
>                 AMX instruction to real AMX instruction by removing
>                 the row and column operands.
>
>             This seems complicated.
>
>             Reading through the documentation, there appears to be a
>             single global tile config for all tile registers at any time.
>
>             Why not simply model this tile config as a designated
>             special register and the tile instructions as having an
>             implicit use of this register?  That would seem to ensure
>             that the register allocator has all the constraints
>             needed.  You'd need to teach it how to spill the special
>             registers with the appropriate instructions, but that
>             seems a lot more straight forward?
>
>                 9.Use recommendation
>
>                 Due to the shape configure issue, we recommend user to
>                 define the tile shape at the entry of the function
>                 entry and inline function as much as possible. The AMX
>                 instructions focus on computation instead of storage,
>                 so global variable for tile data is not recommended.
>
>                 Thanks
>
>                 Yuanke
>
>
>
>
>
>
>                 _______________________________________________
>
>                 LLVM Developers mailing list
>
>                 llvm-dev at lists.llvm.org  <mailto:llvm-dev at lists.llvm.org>
>
>                 https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev  <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
>
>
>         _______________________________________________
>
>         LLVM Developers mailing list
>
>         llvm-dev at lists.llvm.org  <mailto:llvm-dev at lists.llvm.org>
>
>         https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev  <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>     -- 
>
>     Hal Finkel
>
>     Lead, Compiler Technology and Programming Languages
>
>     Leadership Computing Facility
>
>     Argonne National Laboratory
>
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200819/c436985c/attachment.html>