[llvm-dev] Intel AMX programming model discussion.

Sat Aug 15 02:40:22 PDT 2020

>
> On 8/14/20 4:49 PM, Luo, Yuanke wrote:
>
> [Yuanke] AMX register is special. It needs to be configured before use and the config instruction is expensive. To avoid unnecessary tile configure, we collect the tile shape information as much as possible and combine them into one ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX instruction that access tile register. On the other side, the ldtilecfg should post-dominated the instruction that define the tile shape. For tile register spill, it should avoid re-config due to the different tile shape, the spilled register should be reloaded to the register that share the same tile shape. Since tile register allocation is special and it may allocate general virtual register to configure tile register, we can add a sperate pass to do it before general register allocation pass. After register allocation, the tile shape information is not needed anymore, so we can transform the pseudo AMX instruction to real AMX instruction by removing the row and column operands.
>

Has some thought gone into how to make the config instruction less expensive?
I have, for a long time, thought that we need cleverer RAM.
E.g. A single read request that would, for example, return 64 bytes,
with each byte having been spaced out. I.e. Byte 1, skip 99 bytes,
Byte 2, skip 99 bytes Byte 3.
Or, instead of "read the next instruction", "read the next basic block
in one operation". (group of instructions).
This would massively reduce the amount of transactions between the CPU
and the RAM chips.
It would be the RAM chip itself that would do the operation, and not the CPU.
It could also be expanded to have the RAM chip do some simple
computations. E.g. Atomic loads/saves/counters/xor/not/xchg, if they
were cheap to do.
Essentially making the RAM chip able to work better, more efficiently,
with larger chunks of data per transaction.

Kind Regards

James