[PATCH] D91927: [X86] Add x86_amx type for intel AMX.

Mon Dec 21 05:46:04 PST 2020

pengfei added inline comments.

================
Comment at: llvm/lib/IR/ConstantFold.cpp:539
+  if (V->isNullValue() && !DestTy->isX86_MMXTy() && !DestTy->isX86_AMXTy()
+      && opc != Instruction::AddrSpaceCast)
     return Constant::getNullValue(DestTy);
----------------
Operation should at the end of the line.

================
Comment at: llvm/lib/Target/X86/X86LowerAMXType.cpp:55
+  switch (II->getIntrinsicID()) {
+  default: {
+    Row = II->getArgOperand(0);
----------------
I think we'd better to check exceptions. E.g.
```
default:
  llvm_unreachable("");
case Intrinsic::x86_tileloadd64_internal:
case Intrinsic::x86_tdpbssd_internal:
case Intrinsic::x86_tilestored64_internal:
  Row = II->getArgOperand(0);
  Col = II->getArgOperand(1);
  break;
```

================
Comment at: llvm/lib/Target/X86/X86LowerAMXType.cpp:114
+  Value *Row = nullptr, *Col = nullptr;
+  Use &U = *(Bitcast->use_begin());
+  unsigned OpNo = U.getOperandNo();
----------------
Why don't check empty like line 157?

================
Comment at: llvm/lib/Target/X86/X86LowerAMXType.cpp:193
+    // %2 = load <256 x i32>, <256 x i32>* %addr, align 1024
+    auto *II = cast<IntrinsicInst>(Src);
+    Value *Row = II->getOperand(0);
----------------
Is it possible the x86_amx operand isn't from AMX intrinsic, e.g.
```
%src = bitcast <256 x i32> %xxx to x86_amx
%2 = bitcast x86_amx %src to <256 x i32>
```

================
Comment at: llvm/lib/Target/X86/X86LowerAMXType.cpp:233
 bool X86LowerAMXType::visit() {
   bool C;
+  SmallVector<Instruction *, 8> DeadInsts;
----------------
Better move it to line 310.

================
Comment at: llvm/lib/Target/X86/X86LowerAMXType.cpp:238
     for (Instruction &Inst : BB) {
-      LoadInst *LD = dyn_cast<LoadInst>(&Inst);
-      // Check load instruction.
-      // %3 = load <256 x i32>, <256 x i32>* %1, align 64
-      if (LD) {
-        FixedVectorType *VTy = dyn_cast<FixedVectorType>(Inst.getType());
-        if (!IsAMXType(VTy))
-          continue;
-        LDSet.insert(&Inst);
+      if (!dyn_cast<BitCastInst>(&Inst))
         continue;
----------------
Better to reuse the cast result, e.g.
```
BitCastInst *BInst = dyn_cast<BitCastInst>(&Inst);
if (!BInst )
```
You can save several `cast<BitCastInst>(&Inst)` below.

================
Comment at: llvm/lib/Target/X86/X86LowerAMXType.cpp:265
+        // If the dst type is <256 x i32>*, it is valid intruction.
+        // %0 = bitcast x86_amx* %tile to <256 x i32>*
+        // %1 = load <256 x i32>, <256 x i32>* %0, align 64
----------------
Where's `x86_amx* %tile` from? Shouldn't been transfered to `x86_amx` before this bitcast if it exists?

================
Comment at: llvm/lib/Target/X86/X86LowerAMXType.cpp:271
+        LoadInst *LD = dyn_cast<LoadInst>(Src);
+        if (!LD || !LD->hasOneUser()) {
+          transformBitcast(cast<BitCastInst>(&Inst));
----------------
Maybe better to keep a duplicated `load` that calling `transformBitcast`. The same for line 285.

================
Comment at: llvm/lib/Target/X86/X86LowerAMXType.cpp:286
+        if (!Inst.hasOneUser()) {
+          transformBitcast(cast<BitCastInst>(&Inst));
+          DeadInsts.push_back(&Inst);
----------------
Why we need to consider <256 x i32> has more than one use?

================
Comment at: llvm/test/CodeGen/X86/AMX/amx-across-func.ll:89-91
 attributes #2 = { nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="none" "less-precise-fpmad"="false" "min-legal-vector-width"="8192" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+amx-int8,+amx-tile,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" "unsafe-fp-math"="false" "use-soft-float"="false" }
 attributes #3 = { "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="none" "less-precise-fpmad"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+amx-int8,+amx-tile,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" "unsafe-fp-math"="false" "use-soft-float"="false" }
 attributes #4 = { nounwind }
----------------
Better to remove these unused attributes. The same to other tests.

================
Comment at: llvm/test/CodeGen/X86/AMX/amx-type.ll:67
+
+define dso_local void @test_src_add(<256 x i32> %x, <256 x i32> %y, i16 %r, i16 %c, i8* %buf, i64 %s) #2 {
+; CHECK-LABEL: @test_src_add(
----------------
For this and the next test, we have chances to optimize to memcpy if we can make sure %s is constant 64.

================
Comment at: llvm/test/CodeGen/X86/AMX/amx-type.ll:103
+
 define dso_local void @test_load(i8* %in, i8* %out) local_unnamed_addr #2 {
 ; CHECK-LABEL: @test_load(
----------------
We don't need to check this case now, right?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D91927/new/

https://reviews.llvm.org/D91927