[llvm-dev] Intel AMX programming model discussion.

Fri Aug 14 20:29:03 PDT 2020

I find your answer unconvincing.  I'm not going to debate it as I don't 
wish to take the time to build the appropriate context, but my initial 
response is skepticism.

Philip

On 8/14/20 4:49 PM, Luo, Yuanke wrote:
>
> [Yuanke] AMX register is special. It needs to be configured before use 
> and the config instruction is expensive. To avoid unnecessary tile 
> configure, we collect the tile shape information as much as possible 
> and combine them into one ldtilecfg instruction. The ldtilecfg 
> instruction should dominate any AMX instruction that access tile 
> register. On the other side, the ldtilecfg should post-dominated the 
> instruction that define the tile shape. For tile register spill, it 
> should avoid re-config due to the different tile shape, the spilled 
> register should be reloaded to the register that share the same tile 
> shape. Since tile register allocation is special and it may allocate 
> general virtual register to configure tile register, we can add a 
> sperate pass to do it before general register allocation pass. After 
> register allocation, the tile shape information is not needed anymore, 
> so we can transform the pseudo AMX instruction to real AMX instruction 
> by removing the row and column operands.
>
> [Philip]
>
> This seems complicated.
>
> Reading through the documentation, there appears to be a single global 
> tile config for all tile registers at any time.
>
> Why not simply model this tile config as a designated special register 
> and the tile instructions as having an implicit use of this register?  
> That would seem to ensure that the register allocator has all the 
> constraints needed.  You'd need to teach it how to spill the special 
> registers with the appropriate instructions, but that seems a lot more 
> straight forward?
>
> [Yuanke] In that case user need to configure the tile register by 
> themselves. Spilling configure register is very expensive, because it 
> clears all the tile data register to zero. In our proposal, compiler 
> is responsible to deduce the shape for virtual of tile data register, 
> allocate physical registers for them and then configure those physical 
> register. We may build the dependency as you proposed and it can be 
> used for machine IR check to ensure tile data register is configured 
> before use.
>
> *From:*Philip Reames <listmail at philipreames.com>
> *Sent:* Saturday, August 15, 2020 1:17 AM
> *To:* Luo, Yuanke <yuanke.luo at intel.com>; llvm-dev at lists.llvm.org; 
> florian_hahn at apple.com; Kaylor, Andrew <andrew.kaylor at intel.com>; 
> Topper, Craig <craig.topper at intel.com>; Lu, Hongjiu <hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> On 8/14/20 6:27 AM, Luo, Yuanke via llvm-dev wrote:
>
>     Hi,
>
>     Intel Advanced Matrix Extensions (Intel AMX) is a new programming
>     paradigm consisting of two components: a set of 2-dimensional
>     registers (tiles) representing sub-arrays from a larger
>     2-dimensional memory image, and accelerators able to operate on
>     tiles. Capability of Intel AMX implementation is enumerated by
>     palettes. Two palettes are supported: palette 0 represents the
>     initialized state and palette 1 consists of 8 tile registers of up
>     to 1 KB size, which is controlled by a tile control register.
>
>     The instruction manual is posted at
>     https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html.
>
>     The AMX abi proposal is posted at
>     https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4
>     <https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4>.
>
>     This email is to discuss the programming model for AMX.
>     Florian has introduced the matrix type and intrinsics in LLVM
>     community. We’d like to adopt some ideas from it.
>
>     Here is what we propose for the AMX programming model.
>
>     1. Data type.
>
>     We’d like to have fixed vector type for AMX. Since the shape to
>     AMX register can be configurable, the vector size is the maximum
>     size of AMX register. That means the vector size is 1024 bytes.
>
>     The C code may look like this.
>
>     typedef int _tile_data __attribute__((__vector_size__(1024),
>     __aligned__(64)));
>
>     _tile_data tile;
>
>     And the LLVM IR may look like this.
>
>     @tile = dso_local local_unnamed_addr global <256 x i32>
>     zeroinitializer, align 64
>
>     For llvm IR, it is nice to have a new type x86_amxtile that can be
>     mapped to AMX registers.
>
>     2.AMX Intrinsics.
>
>     The internal intrinsics are 1:1 mapped to AMX instructions. The
>     parameter m, n, k identifies the shape of the tile. The shape can
>     be variable, but it cannot exceed the size that AMX HW can
>     support. Compiler can deduce shape of the tile from the AMX
>     intrinsics.
>
>     _tile_data _tile_loadd_internal(char m, short n, const void *base,
>     int stride);
>
>     _tile_data _tile_dpbssd_internal(char m, short n, short k,
>     _tile_data dst, _tile_data src1, _tile_data src2);
>
>     _tile_data _tile_dpbf16ps_internal(char m, short n, short k,
>     _tile_data dst, _tile_data src1, _tile_data src2);
>
>     void _tile_stored_internal(char m, short n, void *base, int
>     stride, _tile_data tile);
>
>     3.User interfaces.
>
>     The tile shape and tile data are combined into a struct in C
>     language. The shape of the tile is only allowed to be initialized
>     once. The user interface looks as this.
>
>        3  #define __DEFAULT_FN_AMX    \
>
>        4 __attribute__((__always_inline__, __nodebug__,
>     __target__("amx-int8")))
>
>        9 typedef struct __tile_str {
>
>     10   const char row;
>
>     11   const short col;
>
>     12   _tile_data tile;
>
>     13 }__tile;
>
>     14
>
>     15 __DEFAULT_FN_AMX
>
>     16 void __tile_loadd(__tile *dst, const void *base, long stride) {
>
>     17   dst->tile = _tile_loadd_internal(dst->row, dst->col, base,
>     stride);
>
>     18 }
>
>     19
>
>     20 __DEFAULT_FN_AMX
>
>     21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
>
>     22   dst->tile = _tile_dpbssd_internal(src1.row, src2.col,
>     src1.col, dst->tile, src1.tile, src2.tile);
>
>     23 }
>
>     24
>
>     25 __DEFAULT_FN_AMX
>
>     26 void __tile_stored(void *base, long stride, __tile src) {
>
>     27 _tile_stored_internal(src.row, src.col, base, stride, src.tile);
>
>     28 }
>
>     4.Example code
>
>     The example shows how to use the user interface in a function.
>
>      51 void api(int cond, short row, short col) {
>
>     52   __tile a = {row, col};
>
>     53   __tile b = {row, col};
>
>     54   __tile c = {row, col};
>
>     55
>
>     56   if(cond) {
>
>     57 __tile_loadd(&a, buf, STRIDE);
>
>     58 __tile_loadd(&b, buf, STRIDE);
>
>     59 __tile_loadd(&c, buf, STRIDE);
>
>     60   } else {
>
>     61 __tile_loadd(&a, buf2, STRIDE);
>
>     62 __tile_loadd(&b, buf2, STRIDE);
>
>     63 __tile_loadd(&c, buf2, STRIDE);
>
>     64   }
>
>     65 __tile_dpbsud(&c, a, b);
>
>     66 __tile_stored(buf, STRIDE, c);
>
>     67 }
>
>     5.LLVM IR
>
>     The LLVM intrinsics IR take the row and column information as the
>     input parameter, so that compiler can deduce the shape of tile
>     data. The remaining parameters are what AMX instructions require.
>     This is the LLVM IR corresponding to the example code.
>
>     12 define dso_local void @api(i32 %cond, i16 signext %row, i16
>     signext %col) local_unnamed_addr #2 {
>
>     13 entry:
>
>     14   %tobool = icmp eq i32 %cond, 0
>
>     15   %sext = shl i16 %col, 8
>
>     16   %conv.i31 = ashr exact i16 %sext, 8
>
>     17   br i1 %tobool, label %if.else, label %if.then
>
>     18
>
>     19 if.then:                                          ; preds = %entry
>
>     20   %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row,
>     i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x
>     i8]* @buf, i64 0, i64 0), i64 32) #3
>
>     21   %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row,
>     i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x
>     i8]* @buf, i64 0, i64 0), i64 32) #3
>
>     22   %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row,
>     i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x
>     i8]* @buf, i64 0, i64 0), i64 32) #3
>
>     23   br label %if.end
>
>     24
>
>     25 if.else:                                          ; preds = %entry
>
>     26   %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row,
>     i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x
>     i8]* @buf2, i64 0, i64 0), i64 32) #3
>
>     27   %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row,
>     i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x
>     i8]* @buf2, i64 0, i64 0), i64 32) #3
>
>     28   %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row,
>     i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x
>     i8]* @buf2, i64 0, i64 0), i64 32) #3
>
>     29   br label %if.end
>
>     30
>
>     31 if.end:                                           ; preds =
>     %if.else, %if.then
>
>     32   %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0,
>     %if.then ]
>
>     33   %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1,
>     %if.then ]
>
>     34   %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2,
>     %if.then ]
>
>     35   %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16
>     %conv.i31, i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32>
>     %a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3
>
>     36   tail call void @llvm.x86.tilestored64(i16 %row, i16
>     %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]*
>     @buf, i64 0, i64 0), i64 32, <256 x i32> %6) #3
>
>     37   ret void
>
>     38 }
>
>     6.Shape propagation
>
>     When in -O0 build, some general load/store for tile vector is
>     generated by front-end. We need to root from AMX intrinsics to
>     propagate the shape information to the virtual tile register. If
>     the an AMX intrinsic use the result of load instruction, the shape
>     is propagated to the load and the load is transformed to tile load
>     intrinsic. If the store instruction uses any result of AMX
>     intrinsic, the shape is propagated to store instruction and the
>     store is transformed to tile store intrinsic
>
>     7.Machine IR
>
>     Since the AMX intrinsics take the row and column as the input
>     parameters, we can create a pseudo instruction corresponding to
>     it. The AMX intrinsics are lowered to the pseudo AMX instruction
>     which has extra row and column operands corresponding to AMX
>     intrinsic. The real AMX instructions don’t need the row and column
>     operands. The row and column information should be configured by
>     ldtilecfg before executing any AMX instruction.
>
>     8.Register allocation
>
>     AMX register is special. It needs to be configured before use and
>     the config instruction is expensive. To avoid unnecessary tile
>     configure, we collect the tile shape information as much as
>     possible and combine them into one ldtilecfg instruction. The
>     ldtilecfg instruction should dominate any AMX instruction that
>     access tile register. On the other side, the ldtilecfg should
>     post-dominated the instruction that define the tile shape. For
>     tile register spill, it should avoid re-config due to the
>     different tile shape, the spilled register should be reloaded to
>     the register that share the same tile shape. Since tile register
>     allocation is special and it may allocate general virtual register
>     to configure tile register, we can add a sperate pass to do it
>     before general register allocation pass. After register
>     allocation, the tile shape information is not needed anymore, so
>     we can transform the pseudo AMX instruction to real AMX
>     instruction by removing the row and column operands.
>
> This seems complicated.
>
> Reading through the documentation, there appears to be a single global 
> tile config for all tile registers at any time.
>
> Why not simply model this tile config as a designated special register 
> and the tile instructions as having an implicit use of this register?  
> That would seem to ensure that the register allocator has all the 
> constraints needed.  You'd need to teach it how to spill the special 
> registers with the appropriate instructions, but that seems a lot more 
> straight forward?
>
>     9.Use recommendation
>
>     Due to the shape configure issue, we recommend user to define the
>     tile shape at the entry of the function entry and inline function
>     as much as possible. The AMX instructions focus on computation
>     instead of storage, so global variable for tile data is not
>     recommended.
>
>     Thanks
>
>     Yuanke
>
>
>
>     _______________________________________________
>
>     LLVM Developers mailing list
>
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>
>     https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200814/8a4244f0/attachment.html>