[llvm-dev] Intel AMX programming model discussion.
Philip Reames via llvm-dev
llvm-dev at lists.llvm.org
Fri Aug 14 20:29:03 PDT 2020
I find your answer unconvincing. I'm not going to debate it as I don't
wish to take the time to build the appropriate context, but my initial
response is skepticism.
Philip
On 8/14/20 4:49 PM, Luo, Yuanke wrote:
>
> [Yuanke] AMX register is special. It needs to be configured before use
> and the config instruction is expensive. To avoid unnecessary tile
> configure, we collect the tile shape information as much as possible
> and combine them into one ldtilecfg instruction. The ldtilecfg
> instruction should dominate any AMX instruction that access tile
> register. On the other side, the ldtilecfg should post-dominated the
> instruction that define the tile shape. For tile register spill, it
> should avoid re-config due to the different tile shape, the spilled
> register should be reloaded to the register that share the same tile
> shape. Since tile register allocation is special and it may allocate
> general virtual register to configure tile register, we can add a
> sperate pass to do it before general register allocation pass. After
> register allocation, the tile shape information is not needed anymore,
> so we can transform the pseudo AMX instruction to real AMX instruction
> by removing the row and column operands.
>
> [Philip]
>
> This seems complicated.
>
> Reading through the documentation, there appears to be a single global
> tile config for all tile registers at any time.
>
> Why not simply model this tile config as a designated special register
> and the tile instructions as having an implicit use of this register?
> That would seem to ensure that the register allocator has all the
> constraints needed. You'd need to teach it how to spill the special
> registers with the appropriate instructions, but that seems a lot more
> straight forward?
>
> [Yuanke] In that case user need to configure the tile register by
> themselves. Spilling configure register is very expensive, because it
> clears all the tile data register to zero. In our proposal, compiler
> is responsible to deduce the shape for virtual of tile data register,
> allocate physical registers for them and then configure those physical
> register. We may build the dependency as you proposed and it can be
> used for machine IR check to ensure tile data register is configured
> before use.
>
> *From:*Philip Reames <listmail at philipreames.com>
> *Sent:* Saturday, August 15, 2020 1:17 AM
> *To:* Luo, Yuanke <yuanke.luo at intel.com>; llvm-dev at lists.llvm.org;
> florian_hahn at apple.com; Kaylor, Andrew <andrew.kaylor at intel.com>;
> Topper, Craig <craig.topper at intel.com>; Lu, Hongjiu <hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> On 8/14/20 6:27 AM, Luo, Yuanke via llvm-dev wrote:
>
> Hi,
>
> Intel Advanced Matrix Extensions (Intel AMX) is a new programming
> paradigm consisting of two components: a set of 2-dimensional
> registers (tiles) representing sub-arrays from a larger
> 2-dimensional memory image, and accelerators able to operate on
> tiles. Capability of Intel AMX implementation is enumerated by
> palettes. Two palettes are supported: palette 0 represents the
> initialized state and palette 1 consists of 8 tile registers of up
> to 1 KB size, which is controlled by a tile control register.
>
> The instruction manual is posted at
> https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html.
>
> The AMX abi proposal is posted at
> https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4
> <https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4>.
>
> This email is to discuss the programming model for AMX.
> Florian has introduced the matrix type and intrinsics in LLVM
> community. We’d like to adopt some ideas from it.
>
> Here is what we propose for the AMX programming model.
>
> 1. Data type.
>
> We’d like to have fixed vector type for AMX. Since the shape to
> AMX register can be configurable, the vector size is the maximum
> size of AMX register. That means the vector size is 1024 bytes.
>
> The C code may look like this.
>
> typedef int _tile_data __attribute__((__vector_size__(1024),
> __aligned__(64)));
>
> _tile_data tile;
>
> And the LLVM IR may look like this.
>
> @tile = dso_local local_unnamed_addr global <256 x i32>
> zeroinitializer, align 64
>
> For llvm IR, it is nice to have a new type x86_amxtile that can be
> mapped to AMX registers.
>
> 2.AMX Intrinsics.
>
> The internal intrinsics are 1:1 mapped to AMX instructions. The
> parameter m, n, k identifies the shape of the tile. The shape can
> be variable, but it cannot exceed the size that AMX HW can
> support. Compiler can deduce shape of the tile from the AMX
> intrinsics.
>
> _tile_data _tile_loadd_internal(char m, short n, const void *base,
> int stride);
>
> _tile_data _tile_dpbssd_internal(char m, short n, short k,
> _tile_data dst, _tile_data src1, _tile_data src2);
>
> _tile_data _tile_dpbf16ps_internal(char m, short n, short k,
> _tile_data dst, _tile_data src1, _tile_data src2);
>
> void _tile_stored_internal(char m, short n, void *base, int
> stride, _tile_data tile);
>
> 3.User interfaces.
>
> The tile shape and tile data are combined into a struct in C
> language. The shape of the tile is only allowed to be initialized
> once. The user interface looks as this.
>
> 3 #define __DEFAULT_FN_AMX \
>
> 4 __attribute__((__always_inline__, __nodebug__,
> __target__("amx-int8")))
>
> 9 typedef struct __tile_str {
>
> 10 const char row;
>
> 11 const short col;
>
> 12 _tile_data tile;
>
> 13 }__tile;
>
> 14
>
> 15 __DEFAULT_FN_AMX
>
> 16 void __tile_loadd(__tile *dst, const void *base, long stride) {
>
> 17 dst->tile = _tile_loadd_internal(dst->row, dst->col, base,
> stride);
>
> 18 }
>
> 19
>
> 20 __DEFAULT_FN_AMX
>
> 21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
>
> 22 dst->tile = _tile_dpbssd_internal(src1.row, src2.col,
> src1.col, dst->tile, src1.tile, src2.tile);
>
> 23 }
>
> 24
>
> 25 __DEFAULT_FN_AMX
>
> 26 void __tile_stored(void *base, long stride, __tile src) {
>
> 27 _tile_stored_internal(src.row, src.col, base, stride, src.tile);
>
> 28 }
>
> 4.Example code
>
> The example shows how to use the user interface in a function.
>
> 51 void api(int cond, short row, short col) {
>
> 52 __tile a = {row, col};
>
> 53 __tile b = {row, col};
>
> 54 __tile c = {row, col};
>
> 55
>
> 56 if(cond) {
>
> 57 __tile_loadd(&a, buf, STRIDE);
>
> 58 __tile_loadd(&b, buf, STRIDE);
>
> 59 __tile_loadd(&c, buf, STRIDE);
>
> 60 } else {
>
> 61 __tile_loadd(&a, buf2, STRIDE);
>
> 62 __tile_loadd(&b, buf2, STRIDE);
>
> 63 __tile_loadd(&c, buf2, STRIDE);
>
> 64 }
>
> 65 __tile_dpbsud(&c, a, b);
>
> 66 __tile_stored(buf, STRIDE, c);
>
> 67 }
>
> 5.LLVM IR
>
> The LLVM intrinsics IR take the row and column information as the
> input parameter, so that compiler can deduce the shape of tile
> data. The remaining parameters are what AMX instructions require.
> This is the LLVM IR corresponding to the example code.
>
> 12 define dso_local void @api(i32 %cond, i16 signext %row, i16
> signext %col) local_unnamed_addr #2 {
>
> 13 entry:
>
> 14 %tobool = icmp eq i32 %cond, 0
>
> 15 %sext = shl i16 %col, 8
>
> 16 %conv.i31 = ashr exact i16 %sext, 8
>
> 17 br i1 %tobool, label %if.else, label %if.then
>
> 18
>
> 19 if.then: ; preds = %entry
>
> 20 %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row,
> i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x
> i8]* @buf, i64 0, i64 0), i64 32) #3
>
> 21 %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row,
> i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x
> i8]* @buf, i64 0, i64 0), i64 32) #3
>
> 22 %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row,
> i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x
> i8]* @buf, i64 0, i64 0), i64 32) #3
>
> 23 br label %if.end
>
> 24
>
> 25 if.else: ; preds = %entry
>
> 26 %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row,
> i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x
> i8]* @buf2, i64 0, i64 0), i64 32) #3
>
> 27 %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row,
> i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x
> i8]* @buf2, i64 0, i64 0), i64 32) #3
>
> 28 %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row,
> i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x
> i8]* @buf2, i64 0, i64 0), i64 32) #3
>
> 29 br label %if.end
>
> 30
>
> 31 if.end: ; preds =
> %if.else, %if.then
>
> 32 %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0,
> %if.then ]
>
> 33 %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1,
> %if.then ]
>
> 34 %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2,
> %if.then ]
>
> 35 %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16
> %conv.i31, i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32>
> %a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3
>
> 36 tail call void @llvm.x86.tilestored64(i16 %row, i16
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf, i64 0, i64 0), i64 32, <256 x i32> %6) #3
>
> 37 ret void
>
> 38 }
>
> 6.Shape propagation
>
> When in -O0 build, some general load/store for tile vector is
> generated by front-end. We need to root from AMX intrinsics to
> propagate the shape information to the virtual tile register. If
> the an AMX intrinsic use the result of load instruction, the shape
> is propagated to the load and the load is transformed to tile load
> intrinsic. If the store instruction uses any result of AMX
> intrinsic, the shape is propagated to store instruction and the
> store is transformed to tile store intrinsic
>
> 7.Machine IR
>
> Since the AMX intrinsics take the row and column as the input
> parameters, we can create a pseudo instruction corresponding to
> it. The AMX intrinsics are lowered to the pseudo AMX instruction
> which has extra row and column operands corresponding to AMX
> intrinsic. The real AMX instructions don’t need the row and column
> operands. The row and column information should be configured by
> ldtilecfg before executing any AMX instruction.
>
> 8.Register allocation
>
> AMX register is special. It needs to be configured before use and
> the config instruction is expensive. To avoid unnecessary tile
> configure, we collect the tile shape information as much as
> possible and combine them into one ldtilecfg instruction. The
> ldtilecfg instruction should dominate any AMX instruction that
> access tile register. On the other side, the ldtilecfg should
> post-dominated the instruction that define the tile shape. For
> tile register spill, it should avoid re-config due to the
> different tile shape, the spilled register should be reloaded to
> the register that share the same tile shape. Since tile register
> allocation is special and it may allocate general virtual register
> to configure tile register, we can add a sperate pass to do it
> before general register allocation pass. After register
> allocation, the tile shape information is not needed anymore, so
> we can transform the pseudo AMX instruction to real AMX
> instruction by removing the row and column operands.
>
> This seems complicated.
>
> Reading through the documentation, there appears to be a single global
> tile config for all tile registers at any time.
>
> Why not simply model this tile config as a designated special register
> and the tile instructions as having an implicit use of this register?
> That would seem to ensure that the register allocator has all the
> constraints needed. You'd need to teach it how to spill the special
> registers with the appropriate instructions, but that seems a lot more
> straight forward?
>
> 9.Use recommendation
>
> Due to the shape configure issue, we recommend user to define the
> tile shape at the entry of the function entry and inline function
> as much as possible. The AMX instructions focus on computation
> instead of storage, so global variable for tile data is not
> recommended.
>
> Thanks
>
> Yuanke
>
>
>
> _______________________________________________
>
> LLVM Developers mailing list
>
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200814/8a4244f0/attachment.html>
More information about the llvm-dev
mailing list