[llvm-dev] Intel AMX programming model discussion.
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Wed Aug 19 12:52:19 PDT 2020
On 8/19/20 10:24 AM, Kaylor, Andrew wrote:
>
> > When the tile shape is unknown at compile time, how do you plan to
> do the register allocation of the tiles? My question is: do you do the
> allocation for this case in the same way as you would if you knew the
> size was 16x16 (i.e., conservatively assume the largest size)?
>
> I think what will happen is that the registers are allocated based on
> a number of runtime values that are assumed to be different from one
> another but less than or equal to 16. So, for example, we’ll allocate
> registers for MxN tiles, NxM tiles and MxM tiles without knowing what
> M and N are. Then at runtime the values of these variables will be
> used to create the actual tile configuration. The instructions that
> need to know the shape take these runtime values as operands.
>
So you're going to multiversion the code?
In any case, my point is that you probably don't need a custom register
allocator. If you just define the tile registers and make sure that the
ldtilecfgs implicitly defines them all, then the regular infrastructure
likely works. You'll have a bunch of register classes, but that's not
necessarily a problem. I recommend trying this, and let us know what you
discover, before we go down the road of a new, dedicated allocator just
for these registers.
-Hal
> There may be some artifacts coming from the front end that
> conservatively assume a 16x16 tile, but I think those generally go
> away in SROA or later specialized passes. Yuanke can confirm or
> correct my understanding of this.
>
> *From:* Hal Finkel <hfinkel at anl.gov>
> *Sent:* Wednesday, August 19, 2020 5:14 AM
> *To:* Luo, Yuanke <yuanke.luo at intel.com>; Kaylor, Andrew
> <andrew.kaylor at intel.com>; Philip Reames <listmail at philipreames.com>;
> llvm-dev at lists.llvm.org; florian_hahn at apple.com; Topper, Craig
> <craig.topper at intel.com>; Lu, Hongjiu <hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> On 8/19/20 5:34 AM, Luo, Yuanke wrote:
>
> There is no problem to have 256 register classes. Just a lot of
> register classes to me.
>
> We don’t assume the shape of each physical register be 16x16, it
> is defined by user. For variable shape, I mean the shape is known
> in runtime and in compile time the shape is unknown. Take below
> code as an example, the %row and %col are variable instead of
> constant. Compiler recognizes llvm.x86.tileloadd64 and deduce the
> shape of %0 is %row x %col.
>
> %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16
> %col, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf,
> i64 0, i64 0), i64 32)
>
> When the tile shape is unknown at compile time, how do you plan to do
> the register allocation of the tiles? My question is: do you do the
> allocation for this case in the same way as you would if you knew the
> size was 16x16 (i.e., conservatively assume the largest size)?
>
> Thanks again,
>
> Hal
>
> *From:* Hal Finkel <hfinkel at anl.gov> <mailto:hfinkel at anl.gov>
> *Sent:* Wednesday, August 19, 2020 4:58 PM
> *To:* Luo, Yuanke <yuanke.luo at intel.com>
> <mailto:yuanke.luo at intel.com>; Kaylor, Andrew
> <andrew.kaylor at intel.com> <mailto:andrew.kaylor at intel.com>; Philip
> Reames <listmail at philipreames.com>
> <mailto:listmail at philipreames.com>; llvm-dev at lists.llvm.org
> <mailto:llvm-dev at lists.llvm.org>; florian_hahn at apple.com
> <mailto:florian_hahn at apple.com>; Topper, Craig
> <craig.topper at intel.com> <mailto:craig.topper at intel.com>; Lu,
> Hongjiu <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> On 8/19/20 2:21 AM, Luo, Yuanke wrote:
>
> Hi Hal,
>
> There is 3 aspect to be solved.
>
> 1.The HW support max shape 16x16, so there are many register
> classes from 1x1 to 16x16. We need 256 register classes.
>
> 2.We want to support variable shape, so compiler don’t know
> what register class to fit tile shape as it is only known in
> runtime.
>
> 3.The tile configure is to configure physical tile register,
> so we need to allocate register and then we know the shape of
> each physical tile register and configure the tile register.
>
> I think your suggestion is helpful to reduce the complexity if
> we only support fixed (constant) tile shape.
>
> -Yuanke
>
> Thanks, Yuanke.
>
> It's not clear to me that having 256 register classes is, in
> itself, a problem. Is it?
>
> What does it mean to support variable-shape tiles in this context?
> Do you do something other than conservatively assume that they are
> 16x16 for register-allocation purposes?
>
> -Hal
>
> *From:* Hal Finkel <hfinkel at anl.gov> <mailto:hfinkel at anl.gov>
> *Sent:* Wednesday, August 19, 2020 8:20 AM
> *To:* Kaylor, Andrew <andrew.kaylor at intel.com>
> <mailto:andrew.kaylor at intel.com>; Philip Reames
> <listmail at philipreames.com>
> <mailto:listmail at philipreames.com>; Luo, Yuanke
> <yuanke.luo at intel.com> <mailto:yuanke.luo at intel.com>;
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>;
> florian_hahn at apple.com <mailto:florian_hahn at apple.com>;
> Topper, Craig <craig.topper at intel.com>
> <mailto:craig.topper at intel.com>; Lu, Hongjiu
> <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> Hi, Andy,
>
> I don't quite understand everything that's going on here.
> Could we model this as:
>
> 1. Define a collection of register classes, one for 2x4
> tiles, one for 4x2 tiles, etc. each populated with a set of
> tile registers. Registers can have aliasing relationships
> (instead of worrying of any kind of subregister/superregister
> relationships -- these won't be useful anyway).
>
> 2. Define the tile-configuration instructions so that they
> implicitly define all of the registers in all of the classes.
>
> Then you would still need to pre-schedule the tile operations
> as you've described, and collect the configuration information
> in order to add the ldtilecfgs, but the regular register
> allocator can handle the allocation itself in the usual way.
> What do you think?
>
> -Hal
>
> On 8/18/20 6:58 PM, Kaylor, Andrew via llvm-dev wrote:
>
> The AMX registers are complicated. The single
> configuration register (which is mostly used implicitly,
> similar to MXCSR for floating point) controls the shape of
> all the tile registers, and if you change the tile
> configuration every single tile register is cleared. In
> practice, if we have to change the the configuration while
> any of the tile registers are live, performance is going
> to be terrible. We need to handle this case for
> correctness, but users of this programming interface will
> need to have enough awareness of the performance issues
> and the hardware details to prevent this. We’ll also want
> a diagnostic that lets the user know when this has happened.
>
> When the tile configuration is set, the shape of each tile
> is locked in, so the individual tile registers aren’t
> interchangeable at that point. If a function needs 2x4
> tiles, 4x2 tiles, and 4x4 tiles, the configuration needs
> to be set with this in mind. The shape isn’t explicit in
> every instruction and intrinsic. It must be deduced. And
> again, we’ll need a way to tell the user when efficient
> allocation can’t be done. In practice, I don’t expect any
> function to be using more than three tile shapes.
>
> The implication of all this is that I don’t think the
> greedy register allocator is well suited to figure all of
> this out. We need a special pass to pre-allocate these
> registers. If the function is written in a way that makes
> good performance possible, it should be a relatively
> simple task to allocate everything with minimal spilling.
> If it isn’t possible to get good performance, we don’t
> need to do anything especially clever. We can just do
> something straightforward that is correct and let the user
> know that they aren’t going to be happy with the results.
>
> -Andy
>
> *From:* Philip Reames <listmail at philipreames.com>
> <mailto:listmail at philipreames.com>
> *Sent:* Friday, August 14, 2020 8:29 PM
> *To:* Luo, Yuanke <yuanke.luo at intel.com>
> <mailto:yuanke.luo at intel.com>; llvm-dev at lists.llvm.org
> <mailto:llvm-dev at lists.llvm.org>; florian_hahn at apple.com
> <mailto:florian_hahn at apple.com>; Kaylor, Andrew
> <andrew.kaylor at intel.com>
> <mailto:andrew.kaylor at intel.com>; Topper, Craig
> <craig.topper at intel.com> <mailto:craig.topper at intel.com>;
> Lu, Hongjiu <hongjiu.lu at intel.com>
> <mailto:hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model
> discussion.
>
> I find your answer unconvincing. I'm not going to debate
> it as I don't wish to take the time to build the
> appropriate context, but my initial response is skepticism.
>
> Philip
>
> On 8/14/20 4:49 PM, Luo, Yuanke wrote:
>
> [Yuanke] AMX register is special. It needs to be
> configured before use and the config instruction is
> expensive. To avoid unnecessary tile configure, we
> collect the tile shape information as much as possible
> and combine them into one ldtilecfg instruction. The
> ldtilecfg instruction should dominate any AMX
> instruction that access tile register. On the other
> side, the ldtilecfg should post-dominated the
> instruction that define the tile shape. For tile
> register spill, it should avoid re-config due to the
> different tile shape, the spilled register should be
> reloaded to the register that share the same tile
> shape. Since tile register allocation is special and
> it may allocate general virtual register to configure
> tile register, we can add a sperate pass to do it
> before general register allocation pass. After
> register allocation, the tile shape information is not
> needed anymore, so we can transform the pseudo AMX
> instruction to real AMX instruction by removing the
> row and column operands.
>
> [Philip]
>
> This seems complicated.
>
> Reading through the documentation, there appears to be
> a single global tile config for all tile registers at
> any time.
>
> Why not simply model this tile config as a designated
> special register and the tile instructions as having
> an implicit use of this register? That would seem to
> ensure that the register allocator has all the
> constraints needed. You'd need to teach it how to
> spill the special registers with the appropriate
> instructions, but that seems a lot more straight forward?
>
> [Yuanke] In that case user need to configure the tile
> register by themselves. Spilling configure register is
> very expensive, because it clears all the tile data
> register to zero. In our proposal, compiler is
> responsible to deduce the shape for virtual of tile
> data register, allocate physical registers for them
> and then configure those physical register. We may
> build the dependency as you proposed and it can be
> used for machine IR check to ensure tile data register
> is configured before use.
>
> *From:* Philip Reames <listmail at philipreames.com>
> <mailto:listmail at philipreames.com>
> *Sent:* Saturday, August 15, 2020 1:17 AM
> *To:* Luo, Yuanke <yuanke.luo at intel.com>
> <mailto:yuanke.luo at intel.com>; llvm-dev at lists.llvm.org
> <mailto:llvm-dev at lists.llvm.org>;
> florian_hahn at apple.com
> <mailto:florian_hahn at apple.com>; Kaylor, Andrew
> <andrew.kaylor at intel.com>
> <mailto:andrew.kaylor at intel.com>; Topper, Craig
> <craig.topper at intel.com>
> <mailto:craig.topper at intel.com>; Lu, Hongjiu
> <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model
> discussion.
>
> On 8/14/20 6:27 AM, Luo, Yuanke via llvm-dev wrote:
>
> Hi,
>
> Intel Advanced Matrix Extensions (Intel AMX) is a
> new programming paradigm consisting of two
> components: a set of 2-dimensional registers
> (tiles) representing sub-arrays from a larger
> 2-dimensional memory image, and accelerators able
> to operate on tiles. Capability of Intel AMX
> implementation is enumerated by palettes. Two
> palettes are supported: palette 0 represents the
> initialized state and palette 1 consists of 8 tile
> registers of up to 1 KB size, which is controlled
> by a tile control register.
>
> The instruction manual is posted at
> https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
> <https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html>.
>
> The AMX abi proposal is posted at
> https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4
> <https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4>.
>
> This email is to discuss the programming model for
> AMX. Florian has introduced the matrix type and
> intrinsics in LLVM community. We’d like to adopt
> some ideas from it.
>
> Here is what we propose for the AMX programming model.
>
> 1. Data type.
>
> We’d like to have fixed vector type for AMX. Since
> the shape to AMX register can be configurable, the
> vector size is the maximum size of AMX register.
> That means the vector size is 1024 bytes.
>
> The C code may look like this.
>
> typedef int _tile_data
> __attribute__((__vector_size__(1024),
> __aligned__(64)));
>
> _tile_data tile;
>
> And the LLVM IR may look like this.
>
> @tile = dso_local local_unnamed_addr global <256 x
> i32> zeroinitializer, align 64
>
> For llvm IR, it is nice to have a new type
> x86_amxtile that can be mapped to AMX registers.
>
> 2.AMX Intrinsics.
>
> The internal intrinsics are 1:1 mapped to AMX
> instructions. The parameter m, n, k identifies the
> shape of the tile. The shape can be variable, but
> it cannot exceed the size that AMX HW can support.
> Compiler can deduce shape of the tile from the AMX
> intrinsics.
>
> _tile_data _tile_loadd_internal(char m, short n,
> const void *base, int stride);
>
> _tile_data _tile_dpbssd_internal(char m, short n,
> short k, _tile_data dst, _tile_data src1,
> _tile_data src2);
>
> _tile_data _tile_dpbf16ps_internal(char m, short
> n, short k, _tile_data dst, _tile_data src1,
> _tile_data src2);
>
> void _tile_stored_internal(char m, short n, void
> *base, int stride, _tile_data tile);
>
> 3.User interfaces.
>
> The tile shape and tile data are combined into a
> struct in C language. The shape of the tile is
> only allowed to be initialized once. The user
> interface looks as this.
>
> 3 #define __DEFAULT_FN_AMX \
>
> 4 __attribute__((__always_inline__,
> __nodebug__, __target__("amx-int8")))
>
> 9 typedef struct __tile_str {
>
> 10 const char row;
>
> 11 const short col;
>
> 12 _tile_data tile;
>
> 13 }__tile;
>
> 14
>
> 15 __DEFAULT_FN_AMX
>
> 16 void __tile_loadd(__tile *dst, const void
> *base, long stride) {
>
> 17 dst->tile = _tile_loadd_internal(dst->row,
> dst->col, base, stride);
>
> 18 }
>
> 19
>
> 20 __DEFAULT_FN_AMX
>
> 21 void __tile_dpbsud(__tile *dst, __tile src1,
> __tile src2) {
>
> 22 dst->tile = _tile_dpbssd_internal(src1.row,
> src2.col, src1.col, dst->tile, src1.tile, src2.tile);
>
> 23 }
>
> 24
>
> 25 __DEFAULT_FN_AMX
>
> 26 void __tile_stored(void *base, long stride,
> __tile src) {
>
> 27 _tile_stored_internal(src.row, src.col, base,
> stride, src.tile);
>
> 28 }
>
> 4.Example code
>
> The example shows how to use the user interface in
> a function.
>
> 51 void api(int cond, short row, short col) {
>
> 52 __tile a = {row, col};
>
> 53 __tile b = {row, col};
>
> 54 __tile c = {row, col};
>
> 55
>
> 56 if(cond) {
>
> 57 __tile_loadd(&a, buf, STRIDE);
>
> 58 __tile_loadd(&b, buf, STRIDE);
>
> 59 __tile_loadd(&c, buf, STRIDE);
>
> 60 } else {
>
> 61 __tile_loadd(&a, buf2, STRIDE);
>
> 62 __tile_loadd(&b, buf2, STRIDE);
>
> 63 __tile_loadd(&c, buf2, STRIDE);
>
> 64 }
>
> 65 __tile_dpbsud(&c, a, b);
>
> 66 __tile_stored(buf, STRIDE, c);
>
> 67 }
>
> 5.LLVM IR
>
> The LLVM intrinsics IR take the row and column
> information as the input parameter, so that
> compiler can deduce the shape of tile data. The
> remaining parameters are what AMX instructions
> require. This is the LLVM IR corresponding to the
> example code.
>
> 12 define dso_local void @api(i32 %cond, i16
> signext %row, i16 signext %col) local_unnamed_addr
> #2 {
>
> 13 entry:
>
> 14 %tobool = icmp eq i32 %cond, 0
>
> 15 %sext = shl i16 %col, 8
>
> 16 %conv.i31 = ashr exact i16 %sext, 8
>
> 17 br i1 %tobool, label %if.else, label %if.then
>
> 18
>
> 19
> if.then:
> ; preds = %entry
>
> 20 %0 = tail call <256 x i32>
> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
> getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf, i64 0, i64 0), i64 32) #3
>
> 21 %1 = tail call <256 x i32>
> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
> getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf, i64 0, i64 0), i64 32) #3
>
> 22 %2 = tail call <256 x i32>
> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
> getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf, i64 0, i64 0), i64 32) #3
>
> 23 br label %if.end
>
> 24
>
> 25 if.else: ; preds = %entry
>
> 26 %3 = tail call <256 x i32>
> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
> getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf2, i64 0, i64 0), i64 32) #3
>
> 27 %4 = tail call <256 x i32>
> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
> getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf2, i64 0, i64 0), i64 32) #3
>
> 28 %5 = tail call <256 x i32>
> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8*
> getelementptr inbounds ([1024 x i8], [1024 x i8]*
> @buf2, i64 0, i64 0), i64 32) #3
>
> 29 br label %if.end
>
> 30
>
> 31
> if.end:
> ; preds = %if.else, %if.then
>
> 32 %a.sroa.1186.0 = phi <256 x i32> [ %3,
> %if.else ], [ %0, %if.then ]
>
> 33 %b.sroa.1068.0 = phi <256 x i32> [ %4,
> %if.else ], [ %1, %if.then ]
>
> 34 %c.sroa.1149.0 = phi <256 x i32> [ %5,
> %if.else ], [ %2, %if.then ]
>
> 35 %6 = tail call <256 x i32>
> @llvm.x86.tdpbssd(i16 %row, i16 %conv.i31, i16
> %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32>
> %a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3
>
> 36 tail call void @llvm.x86.tilestored64(i16
> %row, i16 %conv.i31, i8* getelementptr inbounds
> ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0),
> i64 32, <256 x i32> %6) #3
>
> 37 ret void
>
> 38 }
>
> 6.Shape propagation
>
> When in -O0 build, some general load/store for
> tile vector is generated by front-end. We need to
> root from AMX intrinsics to propagate the shape
> information to the virtual tile register. If the
> an AMX intrinsic use the result of load
> instruction, the shape is propagated to the load
> and the load is transformed to tile load
> intrinsic. If the store instruction uses any
> result of AMX intrinsic, the shape is propagated
> to store instruction and the store is transformed
> to tile store intrinsic
>
> 7.Machine IR
>
> Since the AMX intrinsics take the row and column
> as the input parameters, we can create a pseudo
> instruction corresponding to it. The AMX
> intrinsics are lowered to the pseudo AMX
> instruction which has extra row and column
> operands corresponding to AMX intrinsic. The real
> AMX instructions don’t need the row and column
> operands. The row and column information should be
> configured by ldtilecfg before executing any AMX
> instruction.
>
> 8.Register allocation
>
> AMX register is special. It needs to be configured
> before use and the config instruction is
> expensive. To avoid unnecessary tile configure, we
> collect the tile shape information as much as
> possible and combine them into one ldtilecfg
> instruction. The ldtilecfg instruction should
> dominate any AMX instruction that access tile
> register. On the other side, the ldtilecfg should
> post-dominated the instruction that define the
> tile shape. For tile register spill, it should
> avoid re-config due to the different tile shape,
> the spilled register should be reloaded to the
> register that share the same tile shape. Since
> tile register allocation is special and it may
> allocate general virtual register to configure
> tile register, we can add a sperate pass to do it
> before general register allocation pass. After
> register allocation, the tile shape information is
> not needed anymore, so we can transform the pseudo
> AMX instruction to real AMX instruction by
> removing the row and column operands.
>
> This seems complicated.
>
> Reading through the documentation, there appears to be
> a single global tile config for all tile registers at
> any time.
>
> Why not simply model this tile config as a designated
> special register and the tile instructions as having
> an implicit use of this register? That would seem to
> ensure that the register allocator has all the
> constraints needed. You'd need to teach it how to
> spill the special registers with the appropriate
> instructions, but that seems a lot more straight forward?
>
> 9.Use recommendation
>
> Due to the shape configure issue, we recommend
> user to define the tile shape at the entry of the
> function entry and inline function as much as
> possible. The AMX instructions focus on
> computation instead of storage, so global variable
> for tile data is not recommended.
>
> Thanks
>
> Yuanke
>
>
>
>
>
>
>
> _______________________________________________
>
> LLVM Developers mailing list
>
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
>
>
>
> _______________________________________________
>
> LLVM Developers mailing list
>
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
> --
>
> Hal Finkel
>
> Lead, Compiler Technology and Programming Languages
>
> Leadership Computing Facility
>
> Argonne National Laboratory
>
> --
>
> Hal Finkel
>
> Lead, Compiler Technology and Programming Languages
>
> Leadership Computing Facility
>
> Argonne National Laboratory
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200819/224a6c60/attachment.html>
More information about the llvm-dev
mailing list