[PATCH] D99152: [AMX] Prototype for vector and amx bitcast.

Mon Mar 29 04:24:37 PDT 2021

fhahn added a comment.

In D99152#2655371 <https://reviews.llvm.org/D99152#2655371>, @lebedev.ri wrote:

> In D99152#2655357 <https://reviews.llvm.org/D99152#2655357>, @fhahn wrote:
>
>> 

snip

>> I'm not sure if the loads and store are actually incorrect. `_tile1024i ` is defined as `typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64)));` and the loads/stores are for assignments/reads from variables that have that type, which is `<256 x i32>` in IR. So it is not obvious to me why the loads/stores would be wrong, as long as `_tile1024i` is defined as it is (if it would be a different type, that all changes).
>
> Didn't @LuoYuanke state:
>
> In D99152#2643821 <https://reviews.llvm.org/D99152#2643821>, @LuoYuanke wrote:
>
>> @lebedev.ri, this patch is mainly for discussing the approach that Florian proposed, so I didn't polish my code. Nevertheless your comments for amx_cast.c is right. For __tile_loadd() is to load a 2d tile from memory. There is an extra parameter stride. As I explain in llvm-dev, it load each row from memory to tile register and then base += stride. So the data is not contiguous in memory.
>
> Note the `So the data is not contiguous in memory.`.
> I.e. we won't actually store to `i8* [ptr + 0, ptr + 4096)` byte area.
> Is plain `load`/`store` not defined to perform operations on contigious chunks of memory?
> Am i completely missing the point?

I think that point was not really clear during the discussion. Using `load <256 x i32>` to lower `__tile_loadd() ` would indeed be incorrect. But I don't think that's happening at the moment, at least going from a simple example https://gcc.godbolt.org/z/KT5rczn8j

  void foo() {
    tilea = __builtin_ia32_tileloadd64_internal(16, 64, buf, 64);
  }

is lowered to

  define dso_local void @foo() #0 {
    %1 = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64)
    %2 = bitcast x86_amx %1 to <256 x i32>
    store <256 x i32> %2, <256 x i32>* @tilea, align 64
    ret void
  }

So we emit an intrinsic to do the strided load and the result is stored to continuous memory, which is what the type `_tile1024i` requires. What's not modeled correctly is the conversion between the result of `@llvm.x86.tileloadd64.internal` and the `store`. It needs to be transferred in a flat vector.

Whether we should have `x86_amx` in the first place is a separate question I think. Having a builtin type that does not work properly with fundamental instructions like `load` or `store` seems prone for errors (what instructions actually work with `x86_amx`? Do binary operators work?). Perhaps it would be possible and sufficient to have the intrinsics use an opaque type instead of a builtin type, like

  %my_x86_amx = type opaque

  define %my_x86_amx @foo(%my_x86_amx %x) {
    ret %my_x86_amx %x
  }

But I think we should address those 2 issues separately and fix the biggest problem (mis-use of `bitcast`) first, perhaps followed up by verifier rules rejecting `x86_amx` from un-suitable instructions and go from there.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99152/new/

https://reviews.llvm.org/D99152