[PATCH] D99152: [AMX] Prototype for vector and amx bitcast.

Wed Mar 24 05:09:31 PDT 2021

LuoYuanke added a comment.

> IIUC you need this to transfer/convert data from a consecutive vector to an `AMX` tile. To express that, emitting an intrinsic for the conversion instead a `bit cast` seems the right thing to me.

Yes. We need to transfer/convert data from a consecutive vector to an `AMX` tile. Because in the C language interface the tile defined as vector. `typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64)));`  Take below code (https://gcc.godbolt.org/z/noaWEWd6n) as an example.

  #include <immintrin.h>

  char buf[1024];
  void foo() {
    _tile1024i tile;
    tile = __builtin_ia32_tileloadd64_internal(16, 64, buf, 64);
  }

Compile it with "clang -S -emit-llvm simple_amx.c -mamx-int8" we got below IR.

  define dso_local void @foo() #0 !dbg !15 {
    %1 = alloca <256 x i32>, align 64
    call void @llvm.dbg.declare(metadata <256 x i32>* %1, metadata !18, metadata !DIExpression()), !dbg !25
    %2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64), !dbg !26
    %3 = bitcast x86_amx %2 to <256 x i32>, !dbg !26
    store <256 x i32> %3, <256 x i32>* %1, align 64, !dbg !27
    ret void, !dbg !28
  }

Front-end alloca <256 x i32> for the local variable tile. When the return value of __builtin_ia32_tileloadd64_internal is assigned to tile. Front-end bitcast x86_amx to <256 x i32>. The x86_amx is the type returned from __builtin_ia32_tileloadd64_internal.

> IIUC Roman was saying that from that example alone it is not clear why the explicit conversion in IR is actually needed (please correct me if I am wrong). For the example, you *could* have a version of `llvm.x86.tilestored64.internal` that takes an `<256 x i32>` and does the conversion internally. Having a separate intrinsic to do the conversion gives greater composability in the IR, but I think at the moment it is hard to judge if that is needed, because it is not easy to get an overview of all AMX operations that need support. Is there a summary/documentation of the AMX builtins supported in Clang?

I plan to add AMX operation to Clang doc when the AMX support in LLVM is stable. There are only load/store, zero, dotproduct operations for AMX. We don't have full ISA support to matrix operation.

  __builtin_ia32_tileloadd64_internal
  __builtin_ia32_tdpbssd_internal
  __builtin_ia32_tilestored64_internal
  __builtin_ia32_tilezero_internal

> With respect to the `load` issue, it is not clear to me at the moment under which circumstances regular `load` instructions are generated & interact with AMX. If `load` is used to load `x` consecutive elements, than that's fine. But if the actual intended operation is a strided load, then `load` should not be used (this has also been discussed on llvm-dev).

The `load` instructions are generated because it is a vector in C language. See https://gcc.godbolt.org/z/qv5jnjK48. If we use -O0, there is load instruction generated. If we use -O2, the load instruction is eliminated. The -O2 version is what we want. There is no <256 x i32> in the generated code.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99152/new/

https://reviews.llvm.org/D99152