[PATCH] D98081: [AMDGPU] Improve Codegen for build_vector

Wed Mar 24 04:23:01 PDT 2021

foad added a comment.

Please fix the test first, and then we can iterate on the code changes required to select v_pack_b32_f16.

As a general comment I was hoping that you could still implement this using patterns in SIInstructions.td, but using a PatFrag like HasOneUseUnaryOp to call into some C++ code just to check the isCanonicalized condition. The advantage of this is that it should work for GlobalISel as well as SelectionDAG (as long as you set GISelPredicateCode in the PatFrag).

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp:2690-2691
+  Match this pattern when it's safe to do so
+  (v2f16 (build_vector (f16 (bitconvert (i16 (trunc VGPR_32:$src0)))),
+                       (f16 (bitconvert (i16 (trunc VGPR_32:$src1)))))),
+  (V_PACK_B32_F16_e64 SRCMODS.NONE, VGPR_32:$src0, SRCMODS.NONE, VGPR_32:$src1)
----------------
"bitcast" not "bitconvert".

================
Comment at: llvm/lib/Target/AMDGPU/SIISelLowering.cpp:9633
       }
-    }
+    } else if (Op.getValueType() == MVT::i32 && Src.getValueType() == MVT::f32)
+      return isCanonicalized(DAG, Src, MaxDepth - 1);
----------------
If we're going to define isCanonicalized on integer types (which seems reasonable to me) then how about changing this to just:
```
  case ISD::BITCAST:
    return isCanonicalized(DAG, Src, MaxDepth - 1);
```
... and moving the "hack round the mess" code into `case ISD::TRUNCATE`?

================
Comment at: llvm/test/CodeGen/AMDGPU/v_pack.ll:2
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=gfx1010 -mattr=-flat-for-global -verify-machineinstrs < %s | FileCheck -enable-var-scope -check-prefix=GCN %s
+
----------------
Maybe add a second RUN line with "llc -global-isel ..."?

================
Comment at: llvm/test/CodeGen/AMDGPU/v_pack.ll:30-35
+  %val0 = bitcast float %v0.mul to i32
+  %val1 = bitcast float %v1.add to i32
+  %lo.i = trunc i32 %val0 to i16
+  %hi.i = trunc i32 %val1 to i16
+  %lo = bitcast i16 %lo.i to half
+  %hi = bitcast i16 %hi.i to half
----------------
I think you are testing the wrong thing here. The result of "fadd float" is canonicalized, but if you extract the low 16 bits and bitcast it to half, then you get a meaningless value that is not canonicalized.

Instead you should use two "fadd half" instructions and insert the results into a <2 x half>, with no bitcasting.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D98081/new/

https://reviews.llvm.org/D98081