[llvm-dev] Intel AMX programming model discussion.
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Wed Aug 19 05:14:17 PDT 2020
On 8/19/20 5:34 AM, Luo, Yuanke wrote:
>
> There is no problem to have 256 register classes. Just a lot of
> register classes to me.
>
> We don’t assume the shape of each physical register be 16x16, it is
> defined by user. For variable shape, I mean the shape is known in
> runtime and in compile time the shape is unknown. Take below code as
> an example, the %row and %col are variable instead of constant.
> Compiler recognizes llvm.x86.tileloadd64 and deduce the shape of %0 is
> %row x %col.
>
> %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %col,
> i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64
> 0), i64 32)
>
When the tile shape is unknown at compile time, how do you plan to do
the register allocation of the tiles? My question is: do you do the
allocation for this case in the same way as you would if you knew the
size was 16x16 (i.e., conservatively assume the largest size)?
Thanks again,
Hal
> *From:* Hal Finkel <hfinkel at anl.gov>
> *Sent:* Wednesday, August 19, 2020 4:58 PM
> *To:* Luo, Yuanke <yuanke.luo at intel.com>; Kaylor, Andrew
> <andrew.kaylor at intel.com>; Philip Reames <listmail at philipreames.com>;
> llvm-dev at lists.llvm.org; florian_hahn at apple.com; Topper, Craig
> <craig.topper at intel.com>; Lu, Hongjiu <hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> On 8/19/20 2:21 AM, Luo, Yuanke wrote:
>
> Hi Hal,
>
> There is 3 aspect to be solved.
>
> 1.The HW support max shape 16x16, so there are many register
> classes from 1x1 to 16x16. We need 256 register classes.
>
> 2.We want to support variable shape, so compiler don’t know what
> register class to fit tile shape as it is only known in runtime.
>
> 3.The tile configure is to configure physical tile register, so we
> need to allocate register and then we know the shape of each
> physical tile register and configure the tile register.
>
> I think your suggestion is helpful to reduce the complexity if we
> only support fixed (constant) tile shape.
>
> -Yuanke
>
> Thanks, Yuanke.
>
> It's not clear to me that having 256 register classes is, in itself, a
> problem. Is it?
>
> What does it mean to support variable-shape tiles in this context? Do
> you do something other than conservatively assume that they are 16x16
> for register-allocation purposes?
>
> -Hal
>
> *From:* Hal Finkel <hfinkel at anl.gov> <mailto:hfinkel at anl.gov>
> *Sent:* Wednesday, August 19, 2020 8:20 AM
> *To:* Kaylor, Andrew <andrew.kaylor at intel.com>
> <mailto:andrew.kaylor at intel.com>; Philip Reames
> <listmail at philipreames.com> <mailto:listmail at philipreames.com>;
> Luo, Yuanke <yuanke.luo at intel.com> <mailto:yuanke.luo at intel.com>;
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>;
> florian_hahn at apple.com <mailto:florian_hahn at apple.com>; Topper,
> Craig <craig.topper at intel.com> <mailto:craig.topper at intel.com>;
> Lu, Hongjiu <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> Hi, Andy,
>
> I don't quite understand everything that's going on here. Could we
> model this as:
>
> 1. Define a collection of register classes, one for 2x4 tiles,
> one for 4x2 tiles, etc. each populated with a set of tile
> registers. Registers can have aliasing relationships (instead of
> worrying of any kind of subregister/superregister relationships --
> these won't be useful anyway).
>
> 2. Define the tile-configuration instructions so that they
> implicitly define all of the registers in all of the classes.
>
> Then you would still need to pre-schedule the tile operations as
> you've described, and collect the configuration information in
> order to add the ldtilecfgs, but the regular register allocator
> can handle the allocation itself in the usual way. What do you think?
>
> -Hal
>
> On 8/18/20 6:58 PM, Kaylor, Andrew via llvm-dev wrote:
>
> The AMX registers are complicated. The single configuration
> register (which is mostly used implicitly, similar to MXCSR
> for floating point) controls the shape of all the tile
> registers, and if you change the tile configuration every
> single tile register is cleared. In practice, if we have to
> change the the configuration while any of the tile registers
> are live, performance is going to be terrible. We need to
> handle this case for correctness, but users of this
> programming interface will need to have enough awareness of
> the performance issues and the hardware details to prevent
> this. We’ll also want a diagnostic that lets the user know
> when this has happened.
>
> When the tile configuration is set, the shape of each tile is
> locked in, so the individual tile registers aren’t
> interchangeable at that point. If a function needs 2x4 tiles,
> 4x2 tiles, and 4x4 tiles, the configuration needs to be set
> with this in mind. The shape isn’t explicit in every
> instruction and intrinsic. It must be deduced. And again,
> we’ll need a way to tell the user when efficient allocation
> can’t be done. In practice, I don’t expect any function to be
> using more than three tile shapes.
>
> The implication of all this is that I don’t think the greedy
> register allocator is well suited to figure all of this out.
> We need a special pass to pre-allocate these registers. If the
> function is written in a way that makes good performance
> possible, it should be a relatively simple task to allocate
> everything with minimal spilling. If it isn’t possible to get
> good performance, we don’t need to do anything especially
> clever. We can just do something straightforward that is
> correct and let the user know that they aren’t going to be
> happy with the results.
>
> -Andy
>
> *From:* Philip Reames <listmail at philipreames.com>
> <mailto:listmail at philipreames.com>
> *Sent:* Friday, August 14, 2020 8:29 PM
> *To:* Luo, Yuanke <yuanke.luo at intel.com>
> <mailto:yuanke.luo at intel.com>; llvm-dev at lists.llvm.org
> <mailto:llvm-dev at lists.llvm.org>; florian_hahn at apple.com
> <mailto:florian_hahn at apple.com>; Kaylor, Andrew
> <andrew.kaylor at intel.com> <mailto:andrew.kaylor at intel.com>;
> Topper, Craig <craig.topper at intel.com>
> <mailto:craig.topper at intel.com>; Lu, Hongjiu
> <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> I find your answer unconvincing. I'm not going to debate it
> as I don't wish to take the time to build the appropriate
> context, but my initial response is skepticism.
>
> Philip
>
> On 8/14/20 4:49 PM, Luo, Yuanke wrote:
>
> [Yuanke] AMX register is special. It needs to be
> configured before use and the config instruction is
> expensive. To avoid unnecessary tile configure, we collect
> the tile shape information as much as possible and combine
> them into one ldtilecfg instruction. The ldtilecfg
> instruction should dominate any AMX instruction that
> access tile register. On the other side, the ldtilecfg
> should post-dominated the instruction that define the tile
> shape. For tile register spill, it should avoid re-config
> due to the different tile shape, the spilled register
> should be reloaded to the register that share the same
> tile shape. Since tile register allocation is special and
> it may allocate general virtual register to configure tile
> register, we can add a sperate pass to do it before
> general register allocation pass. After register
> allocation, the tile shape information is not needed
> anymore, so we can transform the pseudo AMX instruction to
> real AMX instruction by removing the row and column operands.
>
> [Philip]
>
> This seems complicated.
>
> Reading through the documentation, there appears to be a
> single global tile config for all tile registers at any time.
>
> Why not simply model this tile config as a designated
> special register and the tile instructions as having an
> implicit use of this register? That would seem to ensure
> that the register allocator has all the constraints
> needed. You'd need to teach it how to spill the special
> registers with the appropriate instructions, but that
> seems a lot more straight forward?
>
> [Yuanke] In that case user need to configure the tile
> register by themselves. Spilling configure register is
> very expensive, because it clears all the tile data
> register to zero. In our proposal, compiler is responsible
> to deduce the shape for virtual of tile data register,
> allocate physical registers for them and then configure
> those physical register. We may build the dependency as
> you proposed and it can be used for machine IR check to
> ensure tile data register is configured before use.
>
> *From:* Philip Reames <listmail at philipreames.com>
> <mailto:listmail at philipreames.com>
> *Sent:* Saturday, August 15, 2020 1:17 AM
> *To:* Luo, Yuanke <yuanke.luo at intel.com>
> <mailto:yuanke.luo at intel.com>; llvm-dev at lists.llvm.org
> <mailto:llvm-dev at lists.llvm.org>; florian_hahn at apple.com
> <mailto:florian_hahn at apple.com>; Kaylor, Andrew
> <andrew.kaylor at intel.com>
> <mailto:andrew.kaylor at intel.com>; Topper, Craig
> <craig.topper at intel.com> <mailto:craig.topper at intel.com>;
> Lu, Hongjiu <hongjiu.lu at intel.com>
> <mailto:hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model
> discussion.
>
> On 8/14/20 6:27 AM, Luo, Yuanke via llvm-dev wrote:
>
> Hi,
>
> Intel Advanced Matrix Extensions (Intel AMX) is a new
> programming paradigm consisting of two components: a
> set of 2-dimensional registers (tiles) representing
> sub-arrays from a larger 2-dimensional memory image,
> and accelerators able to operate on tiles. Capability
> of Intel AMX implementation is enumerated by palettes.
> Two palettes are supported: palette 0 represents the
> initialized state and palette 1 consists of 8 tile
> registers of up to 1 KB size, which is controlled by a
> tile control register.
>
> The instruction manual is posted at
> https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
> <https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html>.
>
> The AMX abi proposal is posted at
> https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4
> <https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4>.
>
> This email is to discuss the programming model for
> AMX. Florian has introduced the matrix type and
> intrinsics in LLVM community. We’d like to adopt some
> ideas from it.
>
> Here is what we propose for the AMX programming model.
>
> 1. Data type.
>
> We’d like to have fixed vector type for AMX. Since the
> shape to AMX register can be configurable, the vector
> size is the maximum size of AMX register. That means
> the vector size is 1024 bytes.
>
> The C code may look like this.
>
> typedef int _tile_data
> __attribute__((__vector_size__(1024), __aligned__(64)));
>
> _tile_data tile;
>
> And the LLVM IR may look like this.
>
> @tile = dso_local local_unnamed_addr global <256 x
> i32> zeroinitializer, align 64
>
> For llvm IR, it is nice to have a new type x86_amxtile
> that can be mapped to AMX registers.
>
> 2.AMX Intrinsics.
>
> The internal intrinsics are 1:1 mapped to AMX
> instructions. The parameter m, n, k identifies the
> shape of the tile. The shape can be variable, but it
> cannot exceed the size that AMX HW can support.
> Compiler can deduce shape of the tile from the AMX
> intrinsics.
>
> _tile_data _tile_loadd_internal(char m, short n, const
> void *base, int stride);
>
> _tile_data _tile_dpbssd_internal(char m, short n,
> short k, _tile_data dst, _tile_data src1, _tile_data
> src2);
>
> _tile_data _tile_dpbf16ps_internal(char m, short n,
> short k, _tile_data dst, _tile_data src1, _tile_data
> src2);
>
> void _tile_stored_internal(char m, short n, void
> *base, int stride, _tile_data tile);
>
> 3.User interfaces.
>
> The tile shape and tile data are combined into a
> struct in C language. The shape of the tile is only
> allowed to be initialized once. The user interface
> looks as this.
>
> 3 #define __DEFAULT_FN_AMX \
>
> 4 __attribute__((__always_inline__, __nodebug__,
> __target__("amx-int8")))
>
> 9 typedef struct __tile_str {
>
> 10 const char row;
>
> 11 const short col;
>
> 12 _tile_data tile;
>
> 13 }__tile;
>
> 14
>
> 15 __DEFAULT_FN_AMX
>
> 16 void __tile_loadd(__tile *dst, const void *base,
> long stride) {
>
> 17 dst->tile = _tile_loadd_internal(dst->row,
> dst->col, base, stride);
>
> 18 }
>
> 19
>
> 20 __DEFAULT_FN_AMX
>
> 21 void __tile_dpbsud(__tile *dst, __tile src1, __tile
> src2) {
>
> 22 dst->tile = _tile_dpbssd_internal(src1.row,
> src2.col, src1.col, dst->tile, src1.tile, src2.tile);
>
> 23 }
>
> 24
>
> 25 __DEFAULT_FN_AMX
>
> 26 void __tile_stored(void *base, long stride, __tile
> src) {
>
> 27 _tile_stored_internal(src.row, src.col, base,
> stride, src.tile);
>
> 28 }
>
> 4.Example code
>
> The example shows how to use the user interface in a
> function.
>
> 51 void api(int cond, short row, short col) {
>
> 52 __tile a = {row, col};
>
> 53 __tile b = {row, col};
>
> 54 __tile c = {row, col};
>
> 55
>
> 56 if(cond) {
>
> 57 __tile_loadd(&a, buf, STRIDE);
>
> 58 __tile_loadd(&b, buf, STRIDE);
>
> 59 __tile_loadd(&c, buf, STRIDE);
>
> 60 } else {
>
> 61 __tile_loadd(&a, buf2, STRIDE);
>
> 62 __tile_loadd(&b, buf2, STRIDE);
>
> 63 __tile_loadd(&c, buf2, STRIDE);
>
> 64 }
>
> 65 __tile_dpbsud(&c, a, b);
>
> 66 __tile_stored(buf, STRIDE, c);
>
> 67 }
>
> 5.LLVM IR
>
> The LLVM intrinsics IR take the row and column
> information as the input parameter, so that compiler
> can deduce the shape of tile data. The remaining
> parameters are what AMX instructions require. This is
> the LLVM IR corresponding to the example code.
>
> 12 define dso_local void @api(i32 %cond, i16 signext
> %row, i16 signext %col) local_unnamed_addr #2 {
>
> 13 entry:
>
> 14 %tobool = icmp eq i32 %cond, 0
>
> 15 %sext = shl i16 %col, 8
>
> 16 %conv.i31 = ashr exact i16 %sext, 8
>
> 17 br i1 %tobool, label %if.else, label %if.then
>
> 18
>
> 19 if.then: ;
> preds = %entry
>
> 20 %0 = tail call <256 x i32>
> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
> getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf, i64 0, i64 0), i64 32) #3
>
> 21 %1 = tail call <256 x i32>
> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
> getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf, i64 0, i64 0), i64 32) #3
>
> 22 %2 = tail call <256 x i32>
> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
> getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf, i64 0, i64 0), i64 32) #3
>
> 23 br label %if.end
>
> 24
>
> 25 if.else: ; preds = %entry
>
> 26 %3 = tail call <256 x i32>
> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
> getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf2, i64 0, i64 0), i64 32) #3
>
> 27 %4 = tail call <256 x i32>
> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
> getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf2, i64 0, i64 0), i64 32) #3
>
> 28 %5 = tail call <256 x i32>
> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
> getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf2, i64 0, i64 0), i64 32) #3
>
> 29 br label %if.end
>
> 30
>
> 31 if.end: ;
> preds = %if.else, %if.then
>
> 32 %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else
> ], [ %0, %if.then ]
>
> 33 %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else
> ], [ %1, %if.then ]
>
> 34 %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else
> ], [ %2, %if.then ]
>
> 35 %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16
> %row, i16 %conv.i31, i16 %conv.i31, <256 x i32>
> %c.sroa.1149.0, <256 x i32> %a.sroa.1186.0, <256 x
> i32> %b.sroa.1068.0) #3
>
> 36 tail call void @llvm.x86.tilestored64(i16 %row,
> i16 %conv.i31, i8* getelementptr inbounds ([1024 x
> i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32, <256 x
> i32> %6) #3
>
> 37 ret void
>
> 38 }
>
> 6.Shape propagation
>
> When in -O0 build, some general load/store for tile
> vector is generated by front-end. We need to root from
> AMX intrinsics to propagate the shape information to
> the virtual tile register. If the an AMX intrinsic use
> the result of load instruction, the shape is
> propagated to the load and the load is transformed to
> tile load intrinsic. If the store instruction uses any
> result of AMX intrinsic, the shape is propagated to
> store instruction and the store is transformed to tile
> store intrinsic
>
> 7.Machine IR
>
> Since the AMX intrinsics take the row and column as
> the input parameters, we can create a pseudo
> instruction corresponding to it. The AMX intrinsics
> are lowered to the pseudo AMX instruction which has
> extra row and column operands corresponding to AMX
> intrinsic. The real AMX instructions don’t need the
> row and column operands. The row and column
> information should be configured by ldtilecfg before
> executing any AMX instruction.
>
> 8.Register allocation
>
> AMX register is special. It needs to be configured
> before use and the config instruction is expensive. To
> avoid unnecessary tile configure, we collect the tile
> shape information as much as possible and combine them
> into one ldtilecfg instruction. The ldtilecfg
> instruction should dominate any AMX instruction that
> access tile register. On the other side, the ldtilecfg
> should post-dominated the instruction that define the
> tile shape. For tile register spill, it should avoid
> re-config due to the different tile shape, the spilled
> register should be reloaded to the register that share
> the same tile shape. Since tile register allocation is
> special and it may allocate general virtual register
> to configure tile register, we can add a sperate pass
> to do it before general register allocation pass.
> After register allocation, the tile shape information
> is not needed anymore, so we can transform the pseudo
> AMX instruction to real AMX instruction by removing
> the row and column operands.
>
> This seems complicated.
>
> Reading through the documentation, there appears to be a
> single global tile config for all tile registers at any time.
>
> Why not simply model this tile config as a designated
> special register and the tile instructions as having an
> implicit use of this register? That would seem to ensure
> that the register allocator has all the constraints
> needed. You'd need to teach it how to spill the special
> registers with the appropriate instructions, but that
> seems a lot more straight forward?
>
> 9.Use recommendation
>
> Due to the shape configure issue, we recommend user to
> define the tile shape at the entry of the function
> entry and inline function as much as possible. The AMX
> instructions focus on computation instead of storage,
> so global variable for tile data is not recommended.
>
> Thanks
>
> Yuanke
>
>
>
>
>
>
> _______________________________________________
>
> LLVM Developers mailing list
>
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
>
>
> _______________________________________________
>
> LLVM Developers mailing list
>
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
> --
>
> Hal Finkel
>
> Lead, Compiler Technology and Programming Languages
>
> Leadership Computing Facility
>
> Argonne National Laboratory
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200819/c436985c/attachment.html>
More information about the llvm-dev
mailing list