[PATCH] D107544: [X86] [AMX] Replace bitcast with specific AMX intrinsics with X86 specific cast.

LuoYuanke via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Mon Aug 16 23:19:52 PDT 2021


LuoYuanke added inline comments.


================
Comment at: llvm/test/CodeGen/X86/AMX/lat-combine-amx-bitcast.ll:106
+for.body.i.lr.ph.i:                               ; preds = %wrapper_entry
+  %1 = call x86_amx @llvm.x86.cast.vector.to.tile.v110i32(<110 x i32> undef)
+  %2 = call x86_amx @llvm.x86.cast.vector.to.tile.v616i8(<616 x i8> undef)
----------------
We can optimize udef or zero vector with tilezero. We may do it in another patch.


================
Comment at: llvm/test/CodeGen/X86/AMX/lat-combine-amx-bitcast.ll:152
+; CHECK:       for.cond.cleanup.i.i:
+; CHECK-NEXT:    [[GOODPHI:%.*]] = phi <110 x i32> [ [[TMP8]], [[WRAPPER_ENTRY:%.*]] ], [ [[TMP17]], [[FOR_BODY_I_LR_PH_I]] ]
+; CHECK-NEXT:    [[TMP18:%.*]] = bitcast <110 x i32>* [[TMP0]] to i8*
----------------
This may be optimized to "phi <x86_amx>", and cast back to vector for its user %evilphi2. But we can optimize it in another patch.


================
Comment at: llvm/test/CodeGen/X86/AMX/lat-combine-amx-bitcast.ll:399
+for.cond.cleanup.i.i:                             ; preds = %for.body.i.lr.ph.i, %wrapper_entry
+  %sub_c.sroa.0.0.i.lcssa.i = phi <110 x i32> [ %tmp, %wrapper_entry ], [ %7, %for.body.i.lr.ph.i ]
+  %8 = call x86_amx @llvm.x86.cast.vector.to.tile.v110i32(<110 x i32> %sub_c.sroa.0.0.i.lcssa.i)
----------------
Could you simplify the name?


================
Comment at: llvm/test/CodeGen/X86/AMX/lat-transform-amx-bitcast.ll:141
+; CHECK-NEXT:    call void @llvm.x86.tilestored64.internal(i16 [[TMP6]], i16 [[TMP8]], i8* [[TMP12]], i64 [[TMP13]], x86_amx [[TMP11]])
+; CHECK-NEXT:    [[TMP14:%.*]] = load <256 x i32>, <256 x i32>* [[TMP4]], align 1024
+; CHECK-NEXT:    [[TMP15:%.*]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str* [[TMP0]], i64 0, i32 2
----------------
We can forward llvm.x86.tilestored64.internal to finial store (line 143). We can optimize it in another patch.


================
Comment at: llvm/test/CodeGen/X86/AMX/lat-transform-amx-bitcast.ll:291
+;
+  %t0 = load <256 x i32>, <256 x i32>* %pa, align 64
+  %t1 = call x86_amx @llvm.x86.cast.vector.to.tile.v256i32(<256 x i32> %t0)
----------------
We can combine load and cast to @llvm.x86.tileloadd64.internal.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D107544/new/

https://reviews.llvm.org/D107544



More information about the llvm-commits mailing list