[PATCH] D99152: [AMX] Prototype for vector and amx bitcast.

Mon Mar 29 07:04:23 PDT 2021

fhahn added a comment.

In D99152#2655691 <https://reviews.llvm.org/D99152#2655691>, @LuoYuanke wrote:

>> I think that point was not really clear during the discussion. Using `load <256 x i32>` to lower `__tile_loadd() ` would indeed be incorrect. But I don't think that's happening at the moment, at least going from a simple example https://gcc.godbolt.org/z/KT5rczn8j
>
> The `load/store <256 x i32>` is generated by front-end, because in C language tile is a vector <256 x i32>. The `load/store <256 x i32>` is transformed to `llvm.x86.tileloadd64.internal/llvm.x86.tilestored64.internal` in `lib/Target/X86/X86LowerAMXType.cpp` if the load result is to be an operand of amx intrinsics or the store value is returned from amx intrinsics.

Sure, you can get rid of unnecessary conversions/loads/stores during optimizations, if you can operate on AMX tiles directly and do not need to store the intermediate results in flat vectors. You can also use strided loads/stores to store continuous memory (no stride between columns/rows). But what optimizations are applied later do not impact the whether IR emitted by Clang is correct or now. It always needs to be correct. Whether to further optimizations are correct is a different problem, but we need a specification for the builtins, intrinsics and the type before going any further in that direction.

>>   void foo() {
>>     tilea = __builtin_ia32_tileloadd64_internal(16, 64, buf, 64);
>>   }
>>
>> is lowered to
>>
>>   define dso_local void @foo() #0 {
>>     %1 = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64)
>>     %2 = bitcast x86_amx %1 to <256 x i32>
>>     store <256 x i32> %2, <256 x i32>* @tilea, align 64
>>     ret void
>>   }
>>
>> So we emit an intrinsic to do the strided load and the result is stored to continuous memory, which is what the type `_tile1024i` requires. What's not modeled correctly is the conversion between the result of `@llvm.x86.tileloadd64.internal` and the `store`. It needs to be transferred in a flat vector.
>
> Yes.  I agree that it needs to be transferred in a flat vector.
>
>> Whether we should have `x86_amx` in the first place is a separate question I think. Having a builtin type that does not work properly with fundamental instructions like `load` or `store` seems prone for errors (what instructions actually work with `x86_amx`? Do binary operators work?). Perhaps it would be possible and sufficient to have the intrinsics use an opaque type instead of a builtin type, like
>
> We only support tileload, tilestore, tilezero, tiletdp (dot product) instructions/intrinsics for `x86_amx`. Is there any opaque type example llvm source code for builtin? This example has some error at https://gcc.godbolt.org/z/ar6WhjTMz.

I think you need to set the input to `LLVM IR`: https://gcc.godbolt.org/z/WexMjsas9

You should be able to use opaque types with overloaded intrinsics. I don't think you define an intrinsic to take a specific opaque type (because it's not known up front).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99152/new/

https://reviews.llvm.org/D99152