[llvm-dev] Intel AMX programming model discussion.
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Wed Aug 19 01:57:35 PDT 2020
On 8/19/20 2:21 AM, Luo, Yuanke wrote:
>
> Hi Hal,
>
> There is 3 aspect to be solved.
>
> 1.The HW support max shape 16x16, so there are many register classes
> from 1x1 to 16x16. We need 256 register classes.
>
> 2.We want to support variable shape, so compiler don’t know what
> register class to fit tile shape as it is only known in runtime.
>
> 3.The tile configure is to configure physical tile register, so we
> need to allocate register and then we know the shape of each physical
> tile register and configure the tile register.
>
> I think your suggestion is helpful to reduce the complexity if we only
> support fixed (constant) tile shape.
>
> -Yuanke
>
Thanks, Yuanke.
It's not clear to me that having 256 register classes is, in itself, a
problem. Is it?
What does it mean to support variable-shape tiles in this context? Do
you do something other than conservatively assume that they are 16x16
for register-allocation purposes?
-Hal
> *From:* Hal Finkel <hfinkel at anl.gov>
> *Sent:* Wednesday, August 19, 2020 8:20 AM
> *To:* Kaylor, Andrew <andrew.kaylor at intel.com>; Philip Reames
> <listmail at philipreames.com>; Luo, Yuanke <yuanke.luo at intel.com>;
> llvm-dev at lists.llvm.org; florian_hahn at apple.com; Topper, Craig
> <craig.topper at intel.com>; Lu, Hongjiu <hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> Hi, Andy,
>
> I don't quite understand everything that's going on here. Could we
> model this as:
>
> 1. Define a collection of register classes, one for 2x4 tiles, one
> for 4x2 tiles, etc. each populated with a set of tile registers.
> Registers can have aliasing relationships (instead of worrying of any
> kind of subregister/superregister relationships -- these won't be
> useful anyway).
>
> 2. Define the tile-configuration instructions so that they implicitly
> define all of the registers in all of the classes.
>
> Then you would still need to pre-schedule the tile operations as
> you've described, and collect the configuration information in order
> to add the ldtilecfgs, but the regular register allocator can handle
> the allocation itself in the usual way. What do you think?
>
> -Hal
>
> On 8/18/20 6:58 PM, Kaylor, Andrew via llvm-dev wrote:
>
> The AMX registers are complicated. The single configuration
> register (which is mostly used implicitly, similar to MXCSR for
> floating point) controls the shape of all the tile registers, and
> if you change the tile configuration every single tile register is
> cleared. In practice, if we have to change the the configuration
> while any of the tile registers are live, performance is going to
> be terrible. We need to handle this case for correctness, but
> users of this programming interface will need to have enough
> awareness of the performance issues and the hardware details to
> prevent this. We’ll also want a diagnostic that lets the user know
> when this has happened.
>
> When the tile configuration is set, the shape of each tile is
> locked in, so the individual tile registers aren’t interchangeable
> at that point. If a function needs 2x4 tiles, 4x2 tiles, and 4x4
> tiles, the configuration needs to be set with this in mind. The
> shape isn’t explicit in every instruction and intrinsic. It must
> be deduced. And again, we’ll need a way to tell the user when
> efficient allocation can’t be done. In practice, I don’t expect
> any function to be using more than three tile shapes.
>
> The implication of all this is that I don’t think the greedy
> register allocator is well suited to figure all of this out. We
> need a special pass to pre-allocate these registers. If the
> function is written in a way that makes good performance possible,
> it should be a relatively simple task to allocate everything with
> minimal spilling. If it isn’t possible to get good performance, we
> don’t need to do anything especially clever. We can just do
> something straightforward that is correct and let the user know
> that they aren’t going to be happy with the results.
>
> -Andy
>
> *From:* Philip Reames <listmail at philipreames.com>
> <mailto:listmail at philipreames.com>
> *Sent:* Friday, August 14, 2020 8:29 PM
> *To:* Luo, Yuanke <yuanke.luo at intel.com>
> <mailto:yuanke.luo at intel.com>; llvm-dev at lists.llvm.org
> <mailto:llvm-dev at lists.llvm.org>; florian_hahn at apple.com
> <mailto:florian_hahn at apple.com>; Kaylor, Andrew
> <andrew.kaylor at intel.com> <mailto:andrew.kaylor at intel.com>;
> Topper, Craig <craig.topper at intel.com>
> <mailto:craig.topper at intel.com>; Lu, Hongjiu
> <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> I find your answer unconvincing. I'm not going to debate it as I
> don't wish to take the time to build the appropriate context, but
> my initial response is skepticism.
>
> Philip
>
> On 8/14/20 4:49 PM, Luo, Yuanke wrote:
>
> [Yuanke] AMX register is special. It needs to be configured
> before use and the config instruction is expensive. To avoid
> unnecessary tile configure, we collect the tile shape
> information as much as possible and combine them into one
> ldtilecfg instruction. The ldtilecfg instruction should
> dominate any AMX instruction that access tile register. On the
> other side, the ldtilecfg should post-dominated the
> instruction that define the tile shape. For tile register
> spill, it should avoid re-config due to the different tile
> shape, the spilled register should be reloaded to the register
> that share the same tile shape. Since tile register allocation
> is special and it may allocate general virtual register to
> configure tile register, we can add a sperate pass to do it
> before general register allocation pass. After register
> allocation, the tile shape information is not needed anymore,
> so we can transform the pseudo AMX instruction to real AMX
> instruction by removing the row and column operands.
>
> [Philip]
>
> This seems complicated.
>
> Reading through the documentation, there appears to be a
> single global tile config for all tile registers at any time.
>
> Why not simply model this tile config as a designated special
> register and the tile instructions as having an implicit use
> of this register? That would seem to ensure that the register
> allocator has all the constraints needed. You'd need to teach
> it how to spill the special registers with the appropriate
> instructions, but that seems a lot more straight forward?
>
> [Yuanke] In that case user need to configure the tile register
> by themselves. Spilling configure register is very expensive,
> because it clears all the tile data register to zero. In our
> proposal, compiler is responsible to deduce the shape for
> virtual of tile data register, allocate physical registers for
> them and then configure those physical register. We may build
> the dependency as you proposed and it can be used for machine
> IR check to ensure tile data register is configured before use.
>
> *From:* Philip Reames <listmail at philipreames.com>
> <mailto:listmail at philipreames.com>
> *Sent:* Saturday, August 15, 2020 1:17 AM
> *To:* Luo, Yuanke <yuanke.luo at intel.com>
> <mailto:yuanke.luo at intel.com>; llvm-dev at lists.llvm.org
> <mailto:llvm-dev at lists.llvm.org>; florian_hahn at apple.com
> <mailto:florian_hahn at apple.com>; Kaylor, Andrew
> <andrew.kaylor at intel.com> <mailto:andrew.kaylor at intel.com>;
> Topper, Craig <craig.topper at intel.com>
> <mailto:craig.topper at intel.com>; Lu, Hongjiu
> <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> On 8/14/20 6:27 AM, Luo, Yuanke via llvm-dev wrote:
>
> Hi,
>
> Intel Advanced Matrix Extensions (Intel AMX) is a new
> programming paradigm consisting of two components: a set
> of 2-dimensional registers (tiles) representing sub-arrays
> from a larger 2-dimensional memory image, and accelerators
> able to operate on tiles. Capability of Intel AMX
> implementation is enumerated by palettes. Two palettes are
> supported: palette 0 represents the initialized state and
> palette 1 consists of 8 tile registers of up to 1 KB size,
> which is controlled by a tile control register.
>
> The instruction manual is posted at
> https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
> <https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html>.
>
> The AMX abi proposal is posted at
> https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4
> <https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4>.
>
> This email is to discuss the programming model for AMX.
> Florian has introduced the matrix type and intrinsics in
> LLVM community. We’d like to adopt some ideas from it.
>
> Here is what we propose for the AMX programming model.
>
> 1. Data type.
>
> We’d like to have fixed vector type for AMX. Since the
> shape to AMX register can be configurable, the vector size
> is the maximum size of AMX register. That means the vector
> size is 1024 bytes.
>
> The C code may look like this.
>
> typedef int _tile_data
> __attribute__((__vector_size__(1024), __aligned__(64)));
>
> _tile_data tile;
>
> And the LLVM IR may look like this.
>
> @tile = dso_local local_unnamed_addr global <256 x i32>
> zeroinitializer, align 64
>
> For llvm IR, it is nice to have a new type x86_amxtile
> that can be mapped to AMX registers.
>
> 2.AMX Intrinsics.
>
> The internal intrinsics are 1:1 mapped to AMX
> instructions. The parameter m, n, k identifies the shape
> of the tile. The shape can be variable, but it cannot
> exceed the size that AMX HW can support. Compiler can
> deduce shape of the tile from the AMX intrinsics.
>
> _tile_data _tile_loadd_internal(char m, short n, const
> void *base, int stride);
>
> _tile_data _tile_dpbssd_internal(char m, short n, short k,
> _tile_data dst, _tile_data src1, _tile_data src2);
>
> _tile_data _tile_dpbf16ps_internal(char m, short n, short
> k, _tile_data dst, _tile_data src1, _tile_data src2);
>
> void _tile_stored_internal(char m, short n, void *base,
> int stride, _tile_data tile);
>
> 3.User interfaces.
>
> The tile shape and tile data are combined into a struct in
> C language. The shape of the tile is only allowed to be
> initialized once. The user interface looks as this.
>
> 3 #define __DEFAULT_FN_AMX \
>
> 4 __attribute__((__always_inline__, __nodebug__,
> __target__("amx-int8")))
>
> 9 typedef struct __tile_str {
>
> 10 const char row;
>
> 11 const short col;
>
> 12 _tile_data tile;
>
> 13 }__tile;
>
> 14
>
> 15 __DEFAULT_FN_AMX
>
> 16 void __tile_loadd(__tile *dst, const void *base, long
> stride) {
>
> 17 dst->tile = _tile_loadd_internal(dst->row, dst->col,
> base, stride);
>
> 18 }
>
> 19
>
> 20 __DEFAULT_FN_AMX
>
> 21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
>
> 22 dst->tile = _tile_dpbssd_internal(src1.row, src2.col,
> src1.col, dst->tile, src1.tile, src2.tile);
>
> 23 }
>
> 24
>
> 25 __DEFAULT_FN_AMX
>
> 26 void __tile_stored(void *base, long stride, __tile src) {
>
> 27 _tile_stored_internal(src.row, src.col, base, stride,
> src.tile);
>
> 28 }
>
> 4.Example code
>
> The example shows how to use the user interface in a
> function.
>
> 51 void api(int cond, short row, short col) {
>
> 52 __tile a = {row, col};
>
> 53 __tile b = {row, col};
>
> 54 __tile c = {row, col};
>
> 55
>
> 56 if(cond) {
>
> 57 __tile_loadd(&a, buf, STRIDE);
>
> 58 __tile_loadd(&b, buf, STRIDE);
>
> 59 __tile_loadd(&c, buf, STRIDE);
>
> 60 } else {
>
> 61 __tile_loadd(&a, buf2, STRIDE);
>
> 62 __tile_loadd(&b, buf2, STRIDE);
>
> 63 __tile_loadd(&c, buf2, STRIDE);
>
> 64 }
>
> 65 __tile_dpbsud(&c, a, b);
>
> 66 __tile_stored(buf, STRIDE, c);
>
> 67 }
>
> 5.LLVM IR
>
> The LLVM intrinsics IR take the row and column information
> as the input parameter, so that compiler can deduce the
> shape of tile data. The remaining parameters are what AMX
> instructions require. This is the LLVM IR corresponding to
> the example code.
>
> 12 define dso_local void @api(i32 %cond, i16 signext %row,
> i16 signext %col) local_unnamed_addr #2 {
>
> 13 entry:
>
> 14 %tobool = icmp eq i32 %cond, 0
>
> 15 %sext = shl i16 %col, 8
>
> 16 %conv.i31 = ashr exact i16 %sext, 8
>
> 17 br i1 %tobool, label %if.else, label %if.then
>
> 18
>
> 19 if.then: ;
> preds = %entry
>
> 20 %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16
> %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x
> i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3
>
> 21 %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16
> %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x
> i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3
>
> 22 %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16
> %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x
> i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3
>
> 23 br label %if.end
>
> 24
>
> 25 if.else: ; preds = %entry
>
> 26 %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16
> %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x
> i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3
>
> 27 %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16
> %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x
> i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3
>
> 28 %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16
> %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x
> i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3
>
> 29 br label %if.end
>
> 30
>
> 31 if.end: ;
> preds = %if.else, %if.then
>
> 32 %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [
> %0, %if.then ]
>
> 33 %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [
> %1, %if.then ]
>
> 34 %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [
> %2, %if.then ]
>
> 35 %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16
> %row, i16 %conv.i31, i16 %conv.i31, <256 x i32>
> %c.sroa.1149.0, <256 x i32> %a.sroa.1186.0, <256 x i32>
> %b.sroa.1068.0) #3
>
> 36 tail call void @llvm.x86.tilestored64(i16 %row, i16
> %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024
> x i8]* @buf, i64 0, i64 0), i64 32, <256 x i32> %6) #3
>
> 37 ret void
>
> 38 }
>
> 6.Shape propagation
>
> When in -O0 build, some general load/store for tile vector
> is generated by front-end. We need to root from AMX
> intrinsics to propagate the shape information to the
> virtual tile register. If the an AMX intrinsic use the
> result of load instruction, the shape is propagated to the
> load and the load is transformed to tile load intrinsic.
> If the store instruction uses any result of AMX intrinsic,
> the shape is propagated to store instruction and the store
> is transformed to tile store intrinsic
>
> 7.Machine IR
>
> Since the AMX intrinsics take the row and column as the
> input parameters, we can create a pseudo instruction
> corresponding to it. The AMX intrinsics are lowered to the
> pseudo AMX instruction which has extra row and column
> operands corresponding to AMX intrinsic. The real AMX
> instructions don’t need the row and column operands. The
> row and column information should be configured by
> ldtilecfg before executing any AMX instruction.
>
> 8.Register allocation
>
> AMX register is special. It needs to be configured before
> use and the config instruction is expensive. To avoid
> unnecessary tile configure, we collect the tile shape
> information as much as possible and combine them into one
> ldtilecfg instruction. The ldtilecfg instruction should
> dominate any AMX instruction that access tile register. On
> the other side, the ldtilecfg should post-dominated the
> instruction that define the tile shape. For tile register
> spill, it should avoid re-config due to the different tile
> shape, the spilled register should be reloaded to the
> register that share the same tile shape. Since tile
> register allocation is special and it may allocate general
> virtual register to configure tile register, we can add a
> sperate pass to do it before general register allocation
> pass. After register allocation, the tile shape
> information is not needed anymore, so we can transform the
> pseudo AMX instruction to real AMX instruction by removing
> the row and column operands.
>
> This seems complicated.
>
> Reading through the documentation, there appears to be a
> single global tile config for all tile registers at any time.
>
> Why not simply model this tile config as a designated special
> register and the tile instructions as having an implicit use
> of this register? That would seem to ensure that the register
> allocator has all the constraints needed. You'd need to teach
> it how to spill the special registers with the appropriate
> instructions, but that seems a lot more straight forward?
>
> 9.Use recommendation
>
> Due to the shape configure issue, we recommend user to
> define the tile shape at the entry of the function entry
> and inline function as much as possible. The AMX
> instructions focus on computation instead of storage, so
> global variable for tile data is not recommended.
>
> Thanks
>
> Yuanke
>
>
>
>
>
> _______________________________________________
>
> LLVM Developers mailing list
>
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
>
> _______________________________________________
>
> LLVM Developers mailing list
>
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200819/d71f82ba/attachment-0001.html>
More information about the llvm-dev
mailing list