[llvm-dev] Intel AMX programming model discussion.
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Fri Aug 14 08:26:56 PDT 2020
On 8/14/20 8:27 AM, Luo, Yuanke via llvm-dev wrote:
>
> Hi,
>
> Intel Advanced Matrix Extensions (Intel AMX) is a new programming
> paradigm consisting of two components: a set of 2-dimensional
> registers (tiles) representing sub-arrays from a larger 2-dimensional
> memory image, and accelerators able to operate on tiles. Capability of
> Intel AMX implementation is enumerated by palettes. Two palettes are
> supported: palette 0 represents the initialized state and palette 1
> consists of 8 tile registers of up to 1 KB size, which is controlled
> by a tile control register.
>
> The instruction manual is posted at
> https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
> <https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html>.
>
> The AMX abi proposal is posted at
> https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4
> <https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4>.
>
> This email is to discuss the programming model for AMX. Florian has
> introduced the matrix type and intrinsics in LLVM community. We’d like
> to adopt some ideas from it.
>
> Here is what we propose for the AMX programming model.
>
> 1. Data type.
>
> We’d like to have fixed vector type for AMX. Since the shape to AMX
> register can be configurable, the vector size is the maximum size of
> AMX register. That means the vector size is 1024 bytes.
>
> The C code may look like this.
>
> typedef int _tile_data __attribute__((__vector_size__(1024),
> __aligned__(64)));
>
> _tile_data tile;
>
> And the LLVM IR may look like this.
>
> @tile = dso_local local_unnamed_addr global <256 x i32>
> zeroinitializer, align 64
>
> For llvm IR, it is nice to have a new type x86_amxtile that can be
> mapped to AMX registers.
>
> 2.AMX Intrinsics.
>
> The internal intrinsics are 1:1 mapped to AMX instructions. The
> parameter m, n, k identifies the shape of the tile. The shape can be
> variable, but it cannot exceed the size that AMX HW can support.
> Compiler can deduce shape of the tile from the AMX intrinsics.
>
> _tile_data _tile_loadd_internal(char m, short n, const void *base, int
> stride);
>
> _tile_data _tile_dpbssd_internal(char m, short n, short k, _tile_data
> dst, _tile_data src1, _tile_data src2);
>
> _tile_data _tile_dpbf16ps_internal(char m, short n, short k,
> _tile_data dst, _tile_data src1, _tile_data src2);
>
> void _tile_stored_internal(char m, short n, void *base, int stride,
> _tile_data tile);
>
> 3.User interfaces.
>
> The tile shape and tile data are combined into a struct in C language.
> The shape of the tile is only allowed to be initialized once. The user
> interface looks as this.
>
> 3 #define __DEFAULT_FN_AMX \
>
> 4 __attribute__((__always_inline__, __nodebug__,
> __target__("amx-int8")))
>
> 9 typedef struct __tile_str {
>
> 10 const char row;
>
> 11 const short col;
>
> 12 _tile_data tile;
>
> 13 }__tile;
>
This interface look convenient, but what happens if one of these types
appears on a function-call boundary? Does this force everything to be
spilled and restored from the stack? Maybe this type needs some
additional attribute to give it a custom register-passing convention?
> 14
>
> 15 __DEFAULT_FN_AMX
>
> 16 void __tile_loadd(__tile *dst, const void *base, long stride) {
>
> 17 dst->tile = _tile_loadd_internal(dst->row, dst->col, base, stride);
>
> 18 }
>
> 19
>
> 20 __DEFAULT_FN_AMX
>
> 21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
>
> 22 dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col,
> dst->tile, src1.tile, src2.tile);
>
> 23 }
>
> 24
>
> 25 __DEFAULT_FN_AMX
>
> 26 void __tile_stored(void *base, long stride, __tile src) {
>
> 27 _tile_stored_internal(src.row, src.col, base, stride, src.tile);
>
> 28 }
>
> 4.Example code
>
> The example shows how to use the user interface in a function.
>
> 51 void api(int cond, short row, short col) {
>
> 52 __tile a = {row, col};
>
> 53 __tile b = {row, col};
>
> 54 __tile c = {row, col};
>
> 55
>
> 56 if(cond) {
>
> 57 __tile_loadd(&a, buf, STRIDE);
>
> 58 __tile_loadd(&b, buf, STRIDE);
>
> 59 __tile_loadd(&c, buf, STRIDE);
>
> 60 } else {
>
> 61 __tile_loadd(&a, buf2, STRIDE);
>
> 62 __tile_loadd(&b, buf2, STRIDE);
>
> 63 __tile_loadd(&c, buf2, STRIDE);
>
> 64 }
>
> 65 __tile_dpbsud(&c, a, b);
>
> 66 __tile_stored(buf, STRIDE, c);
>
> 67 }
>
> 5.LLVM IR
>
> The LLVM intrinsics IR take the row and column information as the
> input parameter, so that compiler can deduce the shape of tile data.
> The remaining parameters are what AMX instructions require. This is
> the LLVM IR corresponding to the example code.
>
> 12 define dso_local void @api(i32 %cond, i16 signext %row, i16 signext
> %col) local_unnamed_addr #2 {
>
> 13 entry:
>
> 14 %tobool = icmp eq i32 %cond, 0
>
> 15 %sext = shl i16 %col, 8
>
> 16 %conv.i31 = ashr exact i16 %sext, 8
>
> 17 br i1 %tobool, label %if.else, label %if.then
>
> 18
>
> 19 if.then: ; preds = %entry
>
> 20 %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf,
> i64 0, i64 0), i64 32) #3
>
> 21 %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf,
> i64 0, i64 0), i64 32) #3
>
> 22 %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf,
> i64 0, i64 0), i64 32) #3
>
> 23 br label %if.end
>
> 24
>
> 25 if.else: ; preds = %entry
>
> 26 %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf2, i64 0, i64 0), i64 32) #3
>
> 27 %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf2, i64 0, i64 0), i64 32) #3
>
> 28 %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf2, i64 0, i64 0), i64 32) #3
>
> 29 br label %if.end
>
> 30
>
> 31 if.end: ; preds =
> %if.else, %if.then
>
> 32 %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0, %if.then ]
>
> 33 %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1, %if.then ]
>
> 34 %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2, %if.then ]
>
> 35 %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16
> %conv.i31, i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32>
> %a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3
>
> 36 tail call void @llvm.x86.tilestored64(i16 %row, i16 %conv.i31,
> i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64
> 0), i64 32, <256 x i32> %6) #3
>
> 37 ret void
>
> 38 }
>
> 6.Shape propagation
>
> When in -O0 build, some general load/store for tile vector is
> generated by front-end. We need to root from AMX intrinsics to
> propagate the shape information to the virtual tile register. If the
> an AMX intrinsic use the result of load instruction, the shape is
> propagated to the load and the load is transformed to tile load
> intrinsic. If the store instruction uses any result of AMX intrinsic,
> the shape is propagated to store instruction and the store is
> transformed to tile store intrinsic
>
> 7.Machine IR
>
> Since the AMX intrinsics take the row and column as the input
> parameters, we can create a pseudo instruction corresponding to it.
> The AMX intrinsics are lowered to the pseudo AMX instruction which has
> extra row and column operands corresponding to AMX intrinsic. The real
> AMX instructions don’t need the row and column operands. The row and
> column information should be configured by ldtilecfg before executing
> any AMX instruction.
>
> 8.Register allocation
>
> AMX register is special. It needs to be configured before use and the
> config instruction is expensive. To avoid unnecessary tile configure,
> we collect the tile shape information as much as possible and combine
> them into one ldtilecfg instruction. The ldtilecfg instruction should
> dominate any AMX instruction that access tile register. On the other
> side, the ldtilecfg should post-dominated the instruction that define
> the tile shape. For tile register spill, it should avoid re-config due
> to the different tile shape, the spilled register should be reloaded
> to the register that share the same tile shape. Since tile register
> allocation is special and it may allocate general virtual register to
> configure tile register, we can add a sperate pass to do it before
> general register allocation pass. After register allocation, the tile
> shape information is not needed anymore, so we can transform the
> pseudo AMX instruction to real AMX instruction by removing the row and
> column operands.
>
Can you take advantage of our IPRA capability so that internal function
calls might avoid this reconfiguration if the necessary configuration is
always done in the caller?
How will the implementation of __builtin_setjmp/longjmp be affected?
Thanks again,
Hal
> 9.Use recommendation
>
> Due to the shape configure issue, we recommend user to define the tile
> shape at the entry of the function entry and inline function as much
> as possible. The AMX instructions focus on computation instead of
> storage, so global variable for tile data is not recommended.
>
> Thanks
>
> Yuanke
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200814/44fb5a9d/attachment.html>
More information about the llvm-dev
mailing list