[PATCH] D99152: [AMX] Prototype for vector and amx bitcast.

Mon Mar 29 03:44:06 PDT 2021

fhahn added a comment.

In D99152#2655304 <https://reviews.llvm.org/D99152#2655304>, @lebedev.ri wrote:

> In D99152#2655274 <https://reviews.llvm.org/D99152#2655274>, @fhahn wrote:
>
>> In D99152#2649520 <https://reviews.llvm.org/D99152#2649520>, @LuoYuanke wrote:
>>
>>> In D99152#2647681 <https://reviews.llvm.org/D99152#2647681>, @fhahn wrote:
>>>
>>>> I can't see any `load <256 x i32>` in the linked example, just a store. Could you check the example?
>>>
>>> I create another example at https://gcc.godbolt.org/z/v6od5ceEz. In bar() function, you can see the `load <256 x i32>*` in the IR. The bar() function is actually buggy code, because tilec is not initialized by amx intrinsics. We want user call amx intrinsic to load/store tile explicitly. Ideally front-end can detect the issue and report error.
>>
>> Thanks AFAIK in those cases the conversion intrinsic makes sense to use, because you effectively need to convert between 2 types in a non-trivial way. @lebedev.ri WDYT?
>
> I'm not sure. I think first and foremost the `load`/`store` miscompile should be addressed.
> I think the rest is confusing because it seems to me that the only reason why that `bitcast`
> is needed is not correctness reason, but as an opaque optimization barrier.

I'm not sure if the loads and store are actually incorrect. `_tile1024i ` is defined as `typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64)));` and the loads/stores are for assignments/reads from variables that have that type, which is `<256 x i32>` in IR. So it is not obvious to me why the loads/stores would be wrong, as long as `_tile1024i` is defined as it is (if it would be a different type, that all changes).

As a consequence,  `__builtin_ia32_tilezero_internal` & the other builtins need to be defined as returning  `_tile1024i` / `<256 x i32>`. I don't think there's any other way to specify this, unless you have a dedicated AMX type in the frontend. IIUC the current lowering is to call an intrinsic that returns `x86_amx` and then a `bitcast` is used for the conversion to the result type `<256 x i32>`, with the (incorrect) assumption that the `bitcast` does complex conversion between types. Another consequence of the builtins returning `_tile1024i` / `<256 x i32>` is that the conversion from the intrinsic result to `<256 x i32>` should happen at the place where Clang emits the call to the intrinsic directly, not patched up later as it is done now.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99152/new/

https://reviews.llvm.org/D99152