[llvm-dev] Intel AMX programming model discussion.

Mon Aug 24 02:02:47 PDT 2020

Hi, Yuanke,

Thanks for writing this up. Let me back up a bit because the scheme I 
proposed last week doesn't work without further modification: within a 
particular "configuration region" (i.e., the code in between the 
LDTILECFG and the TILERELEASE (or next LDTILECFG)), each tile register 
can only be used with one shape, and in addition, no register can have 
its shape changed without zeroing out all of the tile registers. Thus, 
just using different register classes for the different shapes, as I had 
suggested, isn't sufficient to model the allocation requirements. That 
would not prevent the same register from essentially being assigned to 
differently-shaped virtual registers with non-overlapping live ranges 
within one configuration region.

Also, as you point out, when multiple non-static tile shapes are in use, 
if you use one register class for each shape, you would need different 
register classes for these too. Luckily, I don't think that using the 
separate register classes actually buys us anything, so please disregard 
that suggestion of mine. Use only one register class.

Once the configuration regions are identified, you'll know how many tile 
register shapes are required. If this number is greater than eight, then 
you'll need to cut the region (requiring all live tiles to be spilled 
and restored around each re-configuration point). After that, we'll 
assume that we have eight or fewer distinct shapes.

Now the problem is that you need to allocate registers, satisfying all 
of the usual constraints (non-overlapping live ranges, etc.), but with 
an additional constraint: once a physical register has been used with 
some particular tile shape, it cannot be assigned to any other tile shape.

I think that the current infrastructure can support this as follows:

  1. Add an override X86RegisterInfo::getRegAllocationHints. Like 
SystemZRegisterInfo::getRegAllocationHints does sometimes, when hinting 
the tile registers, the function will return true (to indicate a hard 
constraint). As registers are assigned in RegAllocGreedy, 
getRegAllocationHints is called for each virtual register. For virtual 
tile registers, look at the passed VirtRegMap, etc. for already-assigned 
tile virtual registers with different shape requirements as the current 
virtual register (you'll need to cache the shape requirements in 
X86MachineFunctionInfo for this to be efficient), and return a hints 
list consisting of all other non-reserved tile registers.

  2. To support RegAllocFast, which doesn't use getRegAllocationHints, 
you would need to make the configuration regions small enough that it 
doesn't matter (and if you're doing this around every tile instruction, 
this is automatically true).

  3. To support RegAllocPBQP (which is likely a good thing to do, but 
probably not required), I believe you can support this by adding custom 
constraints to the solver (kind of like what AArch64PBQPRegAlloc.cpp does).

Once the allocation process is complete, you'll need to go back and 
update the LDTILECFG data to reflect the chosen shape -> register mapping.

What I don't know, however, is how well the getRegAllocationHints method 
will work. The benefit is that you don't need to write a custom 
pre-allocator allocator. On the other hand, it might visit the virtual 
registers to assign in a suboptimal order because it doesn't really 
understand the constraint being imposed (generally, we just assign 
larger live ranges first). On the other hand, it is a greedy algorithm 
and if you want something systematically closer to optimal, maybe you 
should be using PBQP anyway. If you do end up needing a custom allocator 
for these, I recommend looking at the PBQP solver (which, as I recall, 
is independently reusable).

Hopefully, this is more-helpful advice.

  -Hal

On 8/21/20 9:54 PM, Luo, Yuanke wrote:
>
> It seems I make a mistake on sharing register unit. Can we share 
> register unit for tile register that is within different tile register 
> class (different register class has different tile shape)?  Think 
> about two virtual tile register /%2:vtile1x1 /and /%3:vtile1x2/. First 
> %2 is allocated to $tmm0, after that %2 is killed and %t3 is allocated 
> to $tmm0. This is not allowed, because when $tmm0 is allocated to %2, 
> its shape is configured to 1x1. If we reallocated $tmm0 to %3, then we 
> need to re-config $tmm0 to 1x2 which cause $tmm0~$tmm7 be clobbered.
>
> Yuanke
>
> *From:* Luo, Yuanke
> *Sent:* Friday, August 21, 2020 2:12 PM
> *To:* Hal Finkel <hfinkel at anl.gov>; Topper, Craig 
> <craig.topper at intel.com>; Kaylor, Andrew <andrew.kaylor at intel.com>; 
> Philip Reames <listmail at philipreames.com>; llvm-dev at lists.llvm.org; 
> florian_hahn at apple.com; Lu, Hongjiu <hongjiu.lu at intel.com>
> *Subject:* RE: [llvm-dev] Intel AMX programming model discussion.
>
> Hi Hal,
>
> The proposal is attractive to me, but there is something I still can’t 
> figure out. Let’s take below MIR as an example. We assume we have 256 
> register classes (vtile1x1, vtile1x2, …, tile16x16).
>
> 1.After instruction selection, the pseudo AMX instruction is 
> generated. The name of pseudo instructions have ‘P’ prefix. Now all 
> the AMX pseudo instruction take vtile as register class. Let’s assume 
> %13 is constant 3, %10 is constant 4 and %14 is variable.
>
> /  %1:vtile = *P*TILELOADDV %13:gr16, %10:gr16, %17:gr64, 1, 
> %18:gr64_nosp, 0, $noreg/
>
> /  %2:vtile = *P*TILELOADDV %10:gr16, %14:gr16, %17:gr64, 1, 
> %18:gr64_nosp, 0, $noreg/
>
> /  %3:vtile = *P*TILELOADDV %13:gr16, %14:gr16, %17:gr64, 1, 
> %18:gr64_nosp, 0, $noreg/
>
> /%21:vtile = *P*TDPBSSDV %13:gr16, %10:gr16, %14:gr16, 
> %3:vtile(tied-def 0), %1:vtile, %2:vtile /
>
> 2.The configuration-placement pass looks at all of the AMX 
> pseudo-instructions and identifies regions in which the 
> pseudo-instructions use the same configuration parameters. It first 
> replaces the register class for all tile registers whose shape is 
> known in compile-time. Since the shape of %1 is constant, so it 
> replaces %1:vtile with %1:vtile3x4 which change the register class and 
> morph pseudo instruction into AMX real instruction. The shape of %2 
> and %3 is unknown in compile-time, so it arbitrarily picks up a tile 
> register class which is not assigned before and assign the register 
> class to %2 and %3. After register class allocation, the code is 
> transformed as this. The register class for %2:vtile1x1 and 
> %3:vtile1x2 is allocated.
>
> /*P*LDTILECFG/
>
> /  %1:vtile3x4  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /  %2:vtile1x1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /  %3:vtile1x2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /%21:vtile1x2 = TDPBSSDV %9:vtile1x2(tied-def 0), %1:vtile3x4, 
> %2:vtile1x1 /
>
> Something I am not figured out.
>
> a.I not sure if we can have AMX instruction’s inputs and outputs fit 
> multiple register classes (vtile1x1, …, vtile16x16), otherwise we need 
> 256 pseudo instructions.
>
> b.Whether 256 register class is enough to be allocated. There may be 
> more 256 unknow shape tile registers.
>
> c.In this pass we also find the proper pointer (common dominator) to 
> insert ldtilecfg, but at this time the register is allocated, we don’t 
> know the shape of each physical tile register. So we just insert a 
> pseudo tile config instruction.
>
> 3.All tile register class share the same register unit. We do register 
> allocation by the framework, and the code is transformed as this.
>
> /  $tmm0  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /  $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /  $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1/
>
> 4.Run config pass to collect the shape of each physical tile register 
> and config them. The code can be generated as below. Here is the 
> problem, how can we know the shape of the physical tile register?
>
> */   MOV row, col info to %stack.0 for each physical tile register   
> ??????/*
>
> */  LDTILECFG %stack.0, 1, $noreg, 0, $noreg, implicit-def $tmm0, 
> implicit-def $tmm1, implicit-def $tmm2, implicit-def $tmm3, 
> implicit-def $tmm4, implicit-def $tmm5, implicit-def $tmm6, 
> implicit-def $tmm7/*
>
> /  $tmm0  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /  $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /  $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
> /$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1/
>
> Thanks
>
> Yuanke
>
> ...

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200824/46a9eb84/attachment.html>