[llvm-dev] Intel AMX programming model discussion.

Sat Aug 15 16:42:56 PDT 2020

Sorry. I don't have deep knowledge of the design of HW, so I'm not able to answer the question.

-----Original Message-----
From: James Courtier-Dutton <james.dutton at gmail.com> 
Sent: Saturday, August 15, 2020 5:40 PM
To: Luo, Yuanke <yuanke.luo at intel.com>
Cc: Philip Reames <listmail at philipreames.com>; llvm-dev at lists.llvm.org; florian_hahn at apple.com; Kaylor, Andrew <andrew.kaylor at intel.com>; Topper, Craig <craig.topper at intel.com>; Lu, Hongjiu <hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.

>
> On 8/14/20 4:49 PM, Luo, Yuanke wrote:
>
> [Yuanke] AMX register is special. It needs to be configured before use and the config instruction is expensive. To avoid unnecessary tile configure, we collect the tile shape information as much as possible and combine them into one ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX instruction that access tile register. On the other side, the ldtilecfg should post-dominated the instruction that define the tile shape. For tile register spill, it should avoid re-config due to the different tile shape, the spilled register should be reloaded to the register that share the same tile shape. Since tile register allocation is special and it may allocate general virtual register to configure tile register, we can add a sperate pass to do it before general register allocation pass. After register allocation, the tile shape information is not needed anymore, so we can transform the pseudo AMX instruction to real AMX instruction by removing the row and column operands.
>

Has some thought gone into how to make the config instruction less expensive?
I have, for a long time, thought that we need cleverer RAM.
E.g. A single read request that would, for example, return 64 bytes, with each byte having been spaced out. I.e. Byte 1, skip 99 bytes, Byte 2, skip 99 bytes Byte 3.
Or, instead of "read the next instruction", "read the next basic block in one operation". (group of instructions).
This would massively reduce the amount of transactions between the CPU and the RAM chips.
It would be the RAM chip itself that would do the operation, and not the CPU.
It could also be expanded to have the RAM chip do some simple computations. E.g. Atomic loads/saves/counters/xor/not/xchg, if they were cheap to do.
Essentially making the RAM chip able to work better, more efficiently, with larger chunks of data per transaction.

Kind Regards

James