[llvm-dev] Intel AMX programming model discussion.
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Fri Sep 4 02:58:39 PDT 2020
On 9/4/20 3:37 AM, Luo, Yuanke wrote:
>
> Hi Hal,
>
> Thank you for the ideas that help us to improve the design, and sorry
> for replying late. There is something I am not able to figure out and
> there some special trait for tile RA.
>
You're quite welcome.
> 1.X86RegisterInfo::getRegAllocationHints can tell RA which physical
> register is preferred, but it can’t force RA to just allocate the
> hinted register. If the hinted register is not meet, RA would allocate
> other register.
>
I addressed this below, but I could have been clearer. Like
SystemZRegisterInfo::getRegAllocationHints does sometimes, when hinting
the tile registers, the function will return true. This turns the
preference into a hard constraint, and the allocator will not allocate
any other register. That's my understanding from reading the code.
> 2.The shape information should be attached to each virtual register
> and physical register which is allocated. How to store and get the
> shape information with limited code change on existing RA?
>
For each virtual register, getRegAllocationHints could just recompute
the shape information. If this isn't a constant-time operation, however,
you'll probably want to cache the computed shape requirements in
X86MachineFunctionInfo. You can add a map from registers to shape
information in that class, and accesses it from getRegAllocationHints.
You can store information about the physical registers there too.
Regarding the physical registers, you can grab this information in the
pre-rewrite phase. Override addPreRewrite in X86TargetMachine.cpp.
You'll need a small pass that records relevant information about the
assignments (which, I imagine, is the same small pass that updates the
LDTILECFG instructions). For an example of such a pass, see
AMDGPU/GCNNSAReassign.cpp
> When a tile register is spilled, the shape should also be bound the
> corresponding spill stack slot, so that it can be assigned the
> physical tile register with the same shape.
>
I'm not sure what you mean. If you don't want to just be conservative
about the spill size allocation, you do need to know the shape in order
to compute the spill-location size. I assume that you can grab that out
of X86MachineFunctionInfo from storeRegToStackSlot/loadRegFromStackSlot
or eliminateFrameIndex (or copyPhysReg) as needed.
> 3.There is no mov/copy instruction for tile register. To copy tile
> register, we need to store the tile register to memory and load the
> data from memory to another register. So a lot of code for live
> interval split in Greedy RA is unnecessary for tile register allocation.
>
Yes, but this just means that you need to support copying through
memory. Setting CopyCost = -1 in X86RegisterInfo.td might help as well.
> 4.Compiler can support register spill, but spill should be avoided for
> performance benefit. We prefer reporting warning on register spill, so
> that user can realize it and adjust their code to avoid register spill.
>
If you want to emit a diagnostic, you may be able to do that from
storeRegToStackSlot. In any case, please make use of the
optimization-remark infrastructure. For an example of how to do this,
see RAGreedy::reportNumberOfSplillsReloads in RegAllocGreedy.cpp.
> If there is no easy way to take the advantage of current RA
> infrastructure, there are some pros to have a separate RA for tile
> register.
>
> 1.We can limit the risk to break RA for general register on each arch.
> If there are some bugs on tile RA, only application that use AMX is
> affected.
>
That's true. But I also worry about that. Any time you need to write
non-trivial code that will be used relatively rarely, it's likely to
have bugs that take a long time to show up. If you can plug into the
generic infrastructure, you benefit from the fact that it's
highly-covered, often-used code. Not that you might not run into bugs,
of course, especially if you're using it in a new way, but the base
logic is likely to already be robust.
> 2.We can customize the special trait (config, spilt, spill) of tile
> register in the sperate RA more freely.
>
True.
-Hal
> For RegAllocFast, I agree with you. Each region of register is small,
> and since the performance is not the first priority, we can insert
> multiply config for each small region.
>
> As you recommend looking at the PBQP solver, I’ll take some time to
> investigate it and go back to you.
>
> Thanks
>
> -Yuanke
>
> *From:* Hal Finkel <hfinkel at anl.gov>
> *Sent:* Monday, August 24, 2020 5:03 PM
> *To:* Luo, Yuanke <yuanke.luo at intel.com>; Topper, Craig
> <craig.topper at intel.com>; Kaylor, Andrew <andrew.kaylor at intel.com>;
> Philip Reames <listmail at philipreames.com>; llvm-dev at lists.llvm.org;
> florian_hahn at apple.com; Lu, Hongjiu <hongjiu.lu at intel.com>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> Hi, Yuanke,
>
> Thanks for writing this up. Let me back up a bit because the scheme I
> proposed last week doesn't work without further modification: within a
> particular "configuration region" (i.e., the code in between the
> LDTILECFG and the TILERELEASE (or next LDTILECFG)), each tile register
> can only be used with one shape, and in addition, no register can have
> its shape changed without zeroing out all of the tile registers. Thus,
> just using different register classes for the different shapes, as I
> had suggested, isn't sufficient to model the allocation requirements.
> That would not prevent the same register from essentially being
> assigned to differently-shaped virtual registers with non-overlapping
> live ranges within one configuration region.
>
> Also, as you point out, when multiple non-static tile shapes are in
> use, if you use one register class for each shape, you would need
> different register classes for these too. Luckily, I don't think that
> using the separate register classes actually buys us anything, so
> please disregard that suggestion of mine. Use only one register class.
>
> Once the configuration regions are identified, you'll know how many
> tile register shapes are required. If this number is greater than
> eight, then you'll need to cut the region (requiring all live tiles to
> be spilled and restored around each re-configuration point). After
> that, we'll assume that we have eight or fewer distinct shapes.
>
> Now the problem is that you need to allocate registers, satisfying all
> of the usual constraints (non-overlapping live ranges, etc.), but with
> an additional constraint: once a physical register has been used with
> some particular tile shape, it cannot be assigned to any other tile shape.
>
> I think that the current infrastructure can support this as follows:
>
> 1. Add an override X86RegisterInfo::getRegAllocationHints. Like
> SystemZRegisterInfo::getRegAllocationHints does sometimes, when
> hinting the tile registers, the function will return true (to indicate
> a hard constraint). As registers are assigned in RegAllocGreedy,
> getRegAllocationHints is called for each virtual register. For virtual
> tile registers, look at the passed VirtRegMap, etc. for
> already-assigned tile virtual registers with different shape
> requirements as the current virtual register (you'll need to cache the
> shape requirements in X86MachineFunctionInfo for this to be
> efficient), and return a hints list consisting of all other
> non-reserved tile registers.
>
> 2. To support RegAllocFast, which doesn't use getRegAllocationHints,
> you would need to make the configuration regions small enough that it
> doesn't matter (and if you're doing this around every tile
> instruction, this is automatically true).
>
> 3. To support RegAllocPBQP (which is likely a good thing to do, but
> probably not required), I believe you can support this by adding
> custom constraints to the solver (kind of like what
> AArch64PBQPRegAlloc.cpp does).
>
> Once the allocation process is complete, you'll need to go back and
> update the LDTILECFG data to reflect the chosen shape -> register mapping.
>
> What I don't know, however, is how well the getRegAllocationHints
> method will work. The benefit is that you don't need to write a custom
> pre-allocator allocator. On the other hand, it might visit the virtual
> registers to assign in a suboptimal order because it doesn't really
> understand the constraint being imposed (generally, we just assign
> larger live ranges first). On the other hand, it is a greedy algorithm
> and if you want something systematically closer to optimal, maybe you
> should be using PBQP anyway. If you do end up needing a custom
> allocator for these, I recommend looking at the PBQP solver (which, as
> I recall, is independently reusable).
>
> Hopefully, this is more-helpful advice.
>
> -Hal
>
> On 8/21/20 9:54 PM, Luo, Yuanke wrote:
>
> It seems I make a mistake on sharing register unit. Can we share
> register unit for tile register that is within different tile
> register class (different register class has different tile
> shape)? Think about two virtual tile register /%2:vtile1x1 /and
> /%3:vtile1x2/. First %2 is allocated to $tmm0, after that %2 is
> killed and %t3 is allocated to $tmm0. This is not allowed, because
> when $tmm0 is allocated to %2, its shape is configured to 1x1. If
> we reallocated $tmm0 to %3, then we need to re-config $tmm0 to 1x2
> which cause $tmm0~$tmm7 be clobbered.
>
> Yuanke
>
> *From:* Luo, Yuanke
> *Sent:* Friday, August 21, 2020 2:12 PM
> *To:* Hal Finkel <hfinkel at anl.gov> <mailto:hfinkel at anl.gov>;
> Topper, Craig <craig.topper at intel.com>
> <mailto:craig.topper at intel.com>; Kaylor, Andrew
> <andrew.kaylor at intel.com> <mailto:andrew.kaylor at intel.com>; Philip
> Reames <listmail at philipreames.com>
> <mailto:listmail at philipreames.com>; llvm-dev at lists.llvm.org
> <mailto:llvm-dev at lists.llvm.org>; florian_hahn at apple.com
> <mailto:florian_hahn at apple.com>; Lu, Hongjiu
> <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
> *Subject:* RE: [llvm-dev] Intel AMX programming model discussion.
>
> Hi Hal,
>
> The proposal is attractive to me, but there is something I still
> can’t figure out. Let’s take below MIR as an example. We assume we
> have 256 register classes (vtile1x1, vtile1x2, …, tile16x16).
>
> 1.After instruction selection, the pseudo AMX instruction is
> generated. The name of pseudo instructions have ‘P’ prefix. Now
> all the AMX pseudo instruction take vtile as register class. Let’s
> assume %13 is constant 3, %10 is constant 4 and %14 is variable.
>
> / %1:vtile = *P*TILELOADDV %13:gr16, %10:gr16, %17:gr64, 1,
> %18:gr64_nosp, 0, $noreg/
>
> / %2:vtile = *P*TILELOADDV %10:gr16, %14:gr16, %17:gr64, 1,
> %18:gr64_nosp, 0, $noreg/
>
> / %3:vtile = *P*TILELOADDV %13:gr16, %14:gr16, %17:gr64, 1,
> %18:gr64_nosp, 0, $noreg/
>
> /%21:vtile = *P*TDPBSSDV %13:gr16, %10:gr16, %14:gr16,
> %3:vtile(tied-def 0), %1:vtile, %2:vtile /
>
> 2.The configuration-placement pass looks at all of the AMX
> pseudo-instructions and identifies regions in which the
> pseudo-instructions use the same configuration parameters. It
> first replaces the register class for all tile registers whose
> shape is known in compile-time. Since the shape of %1 is constant,
> so it replaces %1:vtile with %1:vtile3x4 which change the register
> class and morph pseudo instruction into AMX real instruction. The
> shape of %2 and %3 is unknown in compile-time, so it arbitrarily
> picks up a tile register class which is not assigned before and
> assign the register class to %2 and %3. After register class
> allocation, the code is transformed as this. The register class
> for %2:vtile1x1 and %3:vtile1x2 is allocated.
>
> /*P*LDTILECFG/
>
> / %1:vtile3x4 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> / %2:vtile1x1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> / %3:vtile1x2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /%21:vtile1x2 = TDPBSSDV %9:vtile1x2(tied-def 0), %1:vtile3x4,
> %2:vtile1x1 /
>
> Something I am not figured out.
>
> 1.I not sure if we can have AMX instruction’s inputs and outputs
> fit multiple register classes (vtile1x1, …, vtile16x16), otherwise
> we need 256 pseudo instructions.
>
> 2.Whether 256 register class is enough to be allocated. There may
> be more 256 unknow shape tile registers.
>
> 3.In this pass we also find the proper pointer (common dominator)
> to insert ldtilecfg, but at this time the register is allocated,
> we don’t know the shape of each physical tile register. So we just
> insert a pseudo tile config instruction.
>
> 3.All tile register class share the same register unit. We do
> register allocation by the framework, and the code is transformed
> as this.
>
> / $tmm0 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> / $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> / $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1/
>
> 4.Run config pass to collect the shape of each physical tile
> register and config them. The code can be generated as below. Here
> is the problem, how can we know the shape of the physical tile
> register?
>
> */ MOV row, col info to %stack.0 for each physical tile
> register ??????/*
>
> */ LDTILECFG %stack.0, 1, $noreg, 0, $noreg, implicit-def $tmm0,
> implicit-def $tmm1, implicit-def $tmm2, implicit-def $tmm3,
> implicit-def $tmm4, implicit-def $tmm5, implicit-def $tmm6,
> implicit-def $tmm7/*
>
> / $tmm0 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> / $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> / $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1/
>
> Thanks
>
> Yuanke
>
> ...
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200904/9aa59a7e/attachment.html>
More information about the llvm-dev
mailing list