[llvm-dev] Intel AMX programming model discussion.

Hal Finkel via llvm-dev llvm-dev at lists.llvm.org
Wed Aug 19 01:57:35 PDT 2020


On 8/19/20 2:21 AM, Luo, Yuanke wrote:
>
> Hi Hal,
>
> There is 3 aspect to be solved.
>
> 1.The HW support max shape 16x16, so there are many register classes 
> from 1x1 to 16x16. We need 256 register classes.
>
> 2.We want to support variable shape, so compiler don’t know what 
> register class to fit tile shape as it is only known in runtime.
>
> 3.The tile configure is to configure physical tile register, so we 
> need to allocate register and then we know the shape of each physical 
> tile register and configure the tile register.
>
> I think your suggestion is helpful to reduce the complexity if we only 
> support fixed (constant) tile shape.
>
> -Yuanke
>

Thanks, Yuanke.

It's not clear to me that having 256 register classes is, in itself, a 
problem. Is it?

What does it mean to support variable-shape tiles in this context? Do 
you do something other than conservatively assume that they are 16x16 
for register-allocation purposes?

  -Hal


> *From:* Hal Finkel <hfinkel at anl.gov>
> *Sent:* Wednesday, August 19, 2020 8:20 AM
> *To:* Kaylor, Andrew <andrew.kaylor at intel.com>; Philip Reames 
> <listmail at philipreames.com>; Luo, Yuanke <yuanke.luo at intel.com>; 
> llvm-dev at lists.llvm.org; florian_hahn at apple.com; Topper, Craig 
> <craig.topper at intel.com>; Lu, Hongjiu <hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> Hi, Andy,
>
> I don't quite understand everything that's going on here. Could we 
> model this as:
>
>  1. Define a collection of register classes, one for 2x4 tiles, one 
> for 4x2 tiles, etc. each populated with a set of tile registers. 
> Registers can have aliasing relationships (instead of worrying of any 
> kind of subregister/superregister relationships -- these won't be 
> useful anyway).
>
>  2. Define the tile-configuration instructions so that they implicitly 
> define all of the registers in all of the classes.
>
> Then you would still need to pre-schedule the tile operations as 
> you've described, and collect the configuration information in order 
> to add the ldtilecfgs, but the regular register allocator can handle 
> the allocation itself in the usual way. What do you think?
>
>  -Hal
>
> On 8/18/20 6:58 PM, Kaylor, Andrew via llvm-dev wrote:
>
>     The AMX registers are complicated. The single configuration
>     register (which is mostly used implicitly, similar to MXCSR for
>     floating point) controls the shape of all the tile registers, and
>     if you change the tile configuration every single tile register is
>     cleared. In practice, if we have to change the the configuration
>     while any of the tile registers are live, performance is going to
>     be terrible. We need to handle this case for correctness, but
>     users of this programming interface will need to have enough
>     awareness of the performance issues and the hardware details to
>     prevent this. We’ll also want a diagnostic that lets the user know
>     when this has happened.
>
>     When the tile configuration is set, the shape of each tile is
>     locked in, so the individual tile registers aren’t interchangeable
>     at that point. If a function needs 2x4 tiles, 4x2 tiles, and 4x4
>     tiles, the configuration needs to be set with this in mind. The
>     shape isn’t explicit in every instruction and intrinsic. It must
>     be deduced. And again, we’ll need a way to tell the user when
>     efficient allocation can’t be done. In practice, I don’t expect
>     any function to be using more than three tile shapes.
>
>     The implication of all this is that I don’t think the greedy
>     register allocator is well suited to figure all of this out. We
>     need a special pass to pre-allocate these registers. If the
>     function is written in a way that makes good performance possible,
>     it should be a relatively simple task to allocate everything with
>     minimal spilling. If it isn’t possible to get good performance, we
>     don’t need to do anything especially clever. We can just do
>     something straightforward that is correct and let the user know
>     that they aren’t going to be happy with the results.
>
>     -Andy
>
>     *From:* Philip Reames <listmail at philipreames.com>
>     <mailto:listmail at philipreames.com>
>     *Sent:* Friday, August 14, 2020 8:29 PM
>     *To:* Luo, Yuanke <yuanke.luo at intel.com>
>     <mailto:yuanke.luo at intel.com>; llvm-dev at lists.llvm.org
>     <mailto:llvm-dev at lists.llvm.org>; florian_hahn at apple.com
>     <mailto:florian_hahn at apple.com>; Kaylor, Andrew
>     <andrew.kaylor at intel.com> <mailto:andrew.kaylor at intel.com>;
>     Topper, Craig <craig.topper at intel.com>
>     <mailto:craig.topper at intel.com>; Lu, Hongjiu
>     <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
>     *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
>     I find your answer unconvincing.  I'm not going to debate it as I
>     don't wish to take the time to build the appropriate context, but
>     my initial response is skepticism.
>
>     Philip
>
>     On 8/14/20 4:49 PM, Luo, Yuanke wrote:
>
>         [Yuanke] AMX register is special. It needs to be configured
>         before use and the config instruction is expensive. To avoid
>         unnecessary tile configure, we collect the tile shape
>         information as much as possible and combine them into one
>         ldtilecfg instruction. The ldtilecfg instruction should
>         dominate any AMX instruction that access tile register. On the
>         other side, the ldtilecfg should post-dominated the
>         instruction that define the tile shape. For tile register
>         spill, it should avoid re-config due to the different tile
>         shape, the spilled register should be reloaded to the register
>         that share the same tile shape. Since tile register allocation
>         is special and it may allocate general virtual register to
>         configure tile register, we can add a sperate pass to do it
>         before general register allocation pass. After register
>         allocation, the tile shape information is not needed anymore,
>         so we can transform the pseudo AMX instruction to real AMX
>         instruction by removing the row and column operands.
>
>         [Philip]
>
>         This seems complicated.
>
>         Reading through the documentation, there appears to be a
>         single global tile config for all tile registers at any time.
>
>         Why not simply model this tile config as a designated special
>         register and the tile instructions as having an implicit use
>         of this register?  That would seem to ensure that the register
>         allocator has all the constraints needed.  You'd need to teach
>         it how to spill the special registers with the appropriate
>         instructions, but that seems a lot more straight forward?
>
>         [Yuanke] In that case user need to configure the tile register
>         by themselves. Spilling configure register is very expensive,
>         because it clears all the tile data register to zero. In our
>         proposal, compiler is responsible to deduce the shape for
>         virtual of tile data register, allocate physical registers for
>         them and then configure those physical register. We may build
>         the dependency as you proposed and it can be used for machine
>         IR check to ensure tile data register is configured before use.
>
>         *From:* Philip Reames <listmail at philipreames.com>
>         <mailto:listmail at philipreames.com>
>         *Sent:* Saturday, August 15, 2020 1:17 AM
>         *To:* Luo, Yuanke <yuanke.luo at intel.com>
>         <mailto:yuanke.luo at intel.com>; llvm-dev at lists.llvm.org
>         <mailto:llvm-dev at lists.llvm.org>; florian_hahn at apple.com
>         <mailto:florian_hahn at apple.com>; Kaylor, Andrew
>         <andrew.kaylor at intel.com> <mailto:andrew.kaylor at intel.com>;
>         Topper, Craig <craig.topper at intel.com>
>         <mailto:craig.topper at intel.com>; Lu, Hongjiu
>         <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
>         *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
>         On 8/14/20 6:27 AM, Luo, Yuanke via llvm-dev wrote:
>
>             Hi,
>
>             Intel Advanced Matrix Extensions (Intel AMX) is a new
>             programming paradigm consisting of two components: a set
>             of 2-dimensional registers (tiles) representing sub-arrays
>             from a larger 2-dimensional memory image, and accelerators
>             able to operate on tiles. Capability of Intel AMX
>             implementation is enumerated by palettes. Two palettes are
>             supported: palette 0 represents the initialized state and
>             palette 1 consists of 8 tile registers of up to 1 KB size,
>             which is controlled by a tile control register.
>
>             The instruction manual is posted at
>             https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
>             <https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html>.
>
>             The AMX abi proposal is posted at
>             https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4
>             <https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4>.
>
>             This email is to discuss the programming model for AMX.
>             Florian has introduced the matrix type and intrinsics in
>             LLVM community. We’d like to adopt some ideas from it.
>
>             Here is what we propose for the AMX programming model.
>
>             1. Data type.
>
>             We’d like to have fixed vector type for AMX. Since the
>             shape to AMX register can be configurable, the vector size
>             is the maximum size of AMX register. That means the vector
>             size is 1024 bytes.
>
>             The C code may look like this.
>
>             typedef int _tile_data
>             __attribute__((__vector_size__(1024), __aligned__(64)));
>
>             _tile_data tile;
>
>             And the LLVM IR may look like this.
>
>             @tile = dso_local local_unnamed_addr global <256 x i32>
>             zeroinitializer, align 64
>
>             For llvm IR, it is nice to have a new type x86_amxtile
>             that can be mapped to AMX registers.
>
>             2.AMX Intrinsics.
>
>             The internal intrinsics are 1:1 mapped to AMX
>             instructions. The parameter m, n, k identifies the shape
>             of the tile. The shape can be variable, but it cannot
>             exceed the size that AMX HW can support. Compiler can
>             deduce shape of the tile from the AMX intrinsics.
>
>             _tile_data _tile_loadd_internal(char m, short n, const
>             void *base, int stride);
>
>             _tile_data _tile_dpbssd_internal(char m, short n, short k,
>             _tile_data dst, _tile_data src1, _tile_data src2);
>
>             _tile_data _tile_dpbf16ps_internal(char m, short n, short
>             k, _tile_data dst, _tile_data src1, _tile_data src2);
>
>             void _tile_stored_internal(char m, short n, void *base,
>             int stride, _tile_data tile);
>
>             3.User interfaces.
>
>             The tile shape and tile data are combined into a struct in
>             C language. The shape of the tile is only allowed to be
>             initialized once. The user interface looks as this.
>
>                3  #define __DEFAULT_FN_AMX    \
>
>                4 __attribute__((__always_inline__, __nodebug__,
>             __target__("amx-int8")))
>
>                9 typedef struct __tile_str {
>
>             10   const char row;
>
>             11   const short col;
>
>             12   _tile_data tile;
>
>             13 }__tile;
>
>             14
>
>             15 __DEFAULT_FN_AMX
>
>             16 void __tile_loadd(__tile *dst, const void *base, long
>             stride) {
>
>             17   dst->tile = _tile_loadd_internal(dst->row, dst->col,
>             base, stride);
>
>             18 }
>
>             19
>
>             20 __DEFAULT_FN_AMX
>
>             21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
>
>             22   dst->tile = _tile_dpbssd_internal(src1.row, src2.col,
>             src1.col, dst->tile, src1.tile, src2.tile);
>
>             23 }
>
>             24
>
>             25 __DEFAULT_FN_AMX
>
>             26 void __tile_stored(void *base, long stride, __tile src) {
>
>             27   _tile_stored_internal(src.row, src.col, base, stride,
>             src.tile);
>
>             28 }
>
>             4.Example code
>
>             The example shows how to use the user interface in a
>             function.
>
>              51 void api(int cond, short row, short col) {
>
>             52   __tile a = {row, col};
>
>             53   __tile b = {row, col};
>
>             54   __tile c = {row, col};
>
>             55
>
>             56   if(cond) {
>
>             57     __tile_loadd(&a, buf, STRIDE);
>
>             58     __tile_loadd(&b, buf, STRIDE);
>
>             59     __tile_loadd(&c, buf, STRIDE);
>
>             60   } else {
>
>             61     __tile_loadd(&a, buf2, STRIDE);
>
>             62     __tile_loadd(&b, buf2, STRIDE);
>
>             63     __tile_loadd(&c, buf2, STRIDE);
>
>             64   }
>
>             65 __tile_dpbsud(&c, a, b);
>
>             66   __tile_stored(buf, STRIDE, c);
>
>             67 }
>
>             5.LLVM IR
>
>             The LLVM intrinsics IR take the row and column information
>             as the input parameter, so that compiler can deduce the
>             shape of tile data. The remaining parameters are what AMX
>             instructions require. This is the LLVM IR corresponding to
>             the example code.
>
>             12 define dso_local void @api(i32 %cond, i16 signext %row,
>             i16 signext %col) local_unnamed_addr #2 {
>
>             13 entry:
>
>             14   %tobool = icmp eq i32 %cond, 0
>
>             15   %sext = shl i16 %col, 8
>
>             16   %conv.i31 = ashr exact i16 %sext, 8
>
>             17   br i1 %tobool, label %if.else, label %if.then
>
>             18
>
>             19 if.then:                                          ;
>             preds = %entry
>
>             20   %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16
>             %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x
>             i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3
>
>             21   %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16
>             %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x
>             i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3
>
>             22   %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16
>             %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x
>             i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3
>
>             23   br label %if.end
>
>             24
>
>             25 if.else:                     ; preds = %entry
>
>             26   %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16
>             %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x
>             i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3
>
>             27   %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16
>             %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x
>             i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3
>
>             28   %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16
>             %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x
>             i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3
>
>             29   br label %if.end
>
>             30
>
>             31 if.end:                                           ;
>             preds = %if.else, %if.then
>
>             32   %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [
>             %0, %if.then ]
>
>             33   %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [
>             %1, %if.then ]
>
>             34   %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [
>             %2, %if.then ]
>
>             35   %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16
>             %row, i16 %conv.i31, i16 %conv.i31, <256 x i32>
>             %c.sroa.1149.0, <256 x i32> %a.sroa.1186.0, <256 x i32>
>             %b.sroa.1068.0) #3
>
>             36   tail call void @llvm.x86.tilestored64(i16 %row, i16
>             %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024
>             x i8]* @buf, i64 0, i64 0), i64 32, <256 x i32> %6) #3
>
>             37   ret void
>
>             38 }
>
>             6.Shape propagation
>
>             When in -O0 build, some general load/store for tile vector
>             is generated by front-end. We need to root from AMX
>             intrinsics to propagate the shape information to the
>             virtual tile register. If the an AMX intrinsic use the
>             result of load instruction, the shape is propagated to the
>             load and the load is transformed to tile load intrinsic.
>             If the store instruction uses any result of AMX intrinsic,
>             the shape is propagated to store instruction and the store
>             is transformed to tile store intrinsic
>
>             7.Machine IR
>
>             Since the AMX intrinsics take the row and column as the
>             input parameters, we can create a pseudo instruction
>             corresponding to it. The AMX intrinsics are lowered to the
>             pseudo AMX instruction which has extra row and column
>             operands corresponding to AMX intrinsic. The real AMX
>             instructions don’t need the row and column operands. The
>             row and column information should be configured by
>             ldtilecfg before executing any AMX instruction.
>
>             8.Register allocation
>
>             AMX register is special. It needs to be configured before
>             use and the config instruction is expensive. To avoid
>             unnecessary tile configure, we collect the tile shape
>             information as much as possible and combine them into one
>             ldtilecfg instruction. The ldtilecfg instruction should
>             dominate any AMX instruction that access tile register. On
>             the other side, the ldtilecfg should post-dominated the
>             instruction that define the tile shape. For tile register
>             spill, it should avoid re-config due to the different tile
>             shape, the spilled register should be reloaded to the
>             register that share the same tile shape. Since tile
>             register allocation is special and it may allocate general
>             virtual register to configure tile register, we can add a
>             sperate pass to do it before general register allocation
>             pass. After register allocation, the tile shape
>             information is not needed anymore, so we can transform the
>             pseudo AMX instruction to real AMX instruction by removing
>             the row and column operands.
>
>         This seems complicated.
>
>         Reading through the documentation, there appears to be a
>         single global tile config for all tile registers at any time.
>
>         Why not simply model this tile config as a designated special
>         register and the tile instructions as having an implicit use
>         of this register?  That would seem to ensure that the register
>         allocator has all the constraints needed.  You'd need to teach
>         it how to spill the special registers with the appropriate
>         instructions, but that seems a lot more straight forward?
>
>             9.Use recommendation
>
>             Due to the shape configure issue, we recommend user to
>             define the tile shape at the entry of the function entry
>             and inline function as much as possible. The AMX
>             instructions focus on computation instead of storage, so
>             global variable for tile data is not recommended.
>
>             Thanks
>
>             Yuanke
>
>
>
>
>
>             _______________________________________________
>
>             LLVM Developers mailing list
>
>             llvm-dev at lists.llvm.org  <mailto:llvm-dev at lists.llvm.org>
>
>             https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev  <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
>
>     _______________________________________________
>
>     LLVM Developers mailing list
>
>     llvm-dev at lists.llvm.org  <mailto:llvm-dev at lists.llvm.org>
>
>     https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev  <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200819/d71f82ba/attachment-0001.html>


More information about the llvm-dev mailing list