[PATCH] D99152: [AMX] Prototype for vector and amx bitcast.

Wed Mar 24 03:27:10 PDT 2021

fhahn added a comment.

In D99152#2644373 <https://reviews.llvm.org/D99152#2644373>, @LuoYuanke wrote:

>> To be honest i don't really understand why `x86_amx` type is even there.
>> It seems to me that if you just directly used `@llvm.x86.tileloadd64.internal` / `@llvm.x86.tilestored64.internal`,
>> and `s/x86_amx/<256 x i32>/`, none of these problems would be here.
>
> I explained in llvm-dev. I copy the content below.
>
> Bitcasts is introduced by the frontend call amx intrinsics. We use vector to represent 2D amx tile in C language, on the other hand we don’t want to mix our amx tile to other vector operation, so x86_amx is introduced to isolate amx intrinsics from normal vector operation. The bitcast is to monitor that a normal vector is passed to amx intrinsics. In below example, we need to transform the bitcast to a vector store and an amx load intrinsic. The x86_amx* is unexpected at the beginning, but in the pass of InstrCombine the middle-end generate the x86_amx pointer.
>
> entry:
>
>   %add = add <256 x i32> %y, %x
>   %t = bitcast <256 x i32> %add to x86_amx
>   call void @llvm.x86.tilestored64.internal(i16 %r, i16 %c, i8* %buf, i64 %s, x86_amx %t)
>   ret void

IIUC you need this to transfer/convert data from a consecutive vector to an `AMX` tile. To express that, emitting an intrinsic for the conversion instead a `bit cast` seems the right thing to me.

IIUC Roman was saying that from that example alone it is not clear why the explicit conversion in IR is actually needed (please correct me if I am wrong). For the example, you *could* have a version of `llvm.x86.tilestored64.internal` that takes an `<256 x i32>` and does the conversion internally. Having a separate intrinsic to do the conversion gives greater composability in the IR, but I think at the moment it is hard to judge if that is needed, because it is not easy to get an overview of all AMX operations that need support. Is there a summary/documentation of the AMX builtins supported in Clang?

With respect to the `load` issue, it is not clear to me at the moment under which circumstances regular `load` instructions are generated & interact with AMX. If `load` is used to load `x` consecutive elements, than that's fine. But if the actual intended operation is a strided load, then `load` should not be used (this has also been discussed on llvm-dev).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99152/new/

https://reviews.llvm.org/D99152