[PATCH] D105263: [X86] AVX512FP16 instructions enabling 1/6

Fri Aug 6 09:12:15 PDT 2021

pengfei marked 7 inline comments as done.
pengfei added a comment.

Thanks Yuanke.

================
Comment at: clang/lib/Headers/avx512fp16intrin.h:292
+
+  return (__m128h)__builtin_ia32_loadsh128_mask((__v8hf *)__A, src, __U & 1);
+}
----------------
LuoYuanke wrote:
> Just be curious, why not directly use __W?
First, this is a simple mimic of `_mm_mask_load_ss`.
I think the reason is the intrinsic requests `dst[MAX:16] := 0`, while the builtin returns with `src[MAX:16]`.
So we need to explicitly clear the upper bits.

================
Comment at: clang/lib/Headers/avx512fp16intrin.h:319
+    __m512h_u __v;
+  } __attribute__((__packed__, __may_alias__));
+  return ((const struct __loadu_ph *)__p)->__v;
----------------
LuoYuanke wrote:
> What is __may_alias__ used for?
This is used for preventing type-based alias analysis.
https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html#Common-Type-Attributes

"In the context of section 6.5 paragraph 7 of the C99 standard, an lvalue expression dereferencing such a pointer is treated like having a character type."
"This extension exists to support some vector APIs, in which pointers to one vector type are permitted to alias pointers to a different vector type."

================
Comment at: clang/lib/Headers/avx512fp16intrin.h:350
+                                                               __m128h __A) {
+  __builtin_ia32_storesh128_mask((__v8hf *)__W, __A, __U & 1);
+}
----------------
LuoYuanke wrote:
> I see in _mm_mask_load_sh(), we create a __m128h with upper bits zero, not sure we also need it in store intrinsic.
Both load and store intrinsics only access 16bit memory, the different is the load intrinsic needs to set up the high bits of the XMM register (because we do return a 128 bits result). We don't need to do that for a store.

================
Comment at: clang/lib/Headers/avx512fp16intrin.h:419
+static __inline__ short __DEFAULT_FN_ATTRS128 _mm_cvtsi128_si16(__m128i __a) {
+  __v8hi __b = (__v8hi)__a;
+  return __b[0];
----------------
LuoYuanke wrote:
> Why not return __a[0] directly?
Because `__m128i` is defined as <2 x i64>. __a[0] is correct only for i64 type.

================
Comment at: clang/test/CodeGen/X86/avx512fp16-abi.c:89
+  _Float16 a;
+  float b;
+};
----------------
LuoYuanke wrote:
> Any false test case that have padding between a and b?
This is the one with padding, since _Float16 aligns to 2 bytes while float aligns to 4.

================
Comment at: llvm/include/llvm/IR/Intrinsics.td:315
 def llvm_v8f16_ty      : LLVMType<v8f16>;    //  8 x half (__fp16)
+def llvm_v16f16_ty     : LLVMType<v16f16>;   // 16 x half (__fp16)
+def llvm_v32f16_ty     : LLVMType<v32f16>;   // 32 x half (__fp16)
----------------
LuoYuanke wrote:
> Not sure about the legacy comments, should it be _Float16 now?
LLVM IR serves for not only one type. `__fp16` is still usable in Clang. Besides, OpenCL half type also use half in IR. And maybe we have other FE types too. So I'd like to keep it as is unless we have a better way to cover all other FE types.

================
Comment at: llvm/include/llvm/Target/TargetSelectionDAG.td:1054
+def extloadvf16 : PatFrag<(ops node:$ptr), (extload node:$ptr)> {
+  let IsLoad = 1;
+  let ScalarMemoryVT = f16;
----------------
LuoYuanke wrote:
> I notice it is true for other extload. Is it same to "true"?
Good catch. I noticed it too, but forgot to change it.

================
Comment at: llvm/lib/Target/X86/Disassembler/X86Disassembler.cpp:341
     if ((insn->mode == MODE_64BIT || (byte1 & 0xc0) == 0xc0) &&
-        ((~byte1 & 0xc) == 0xc) && ((byte2 & 0x4) == 0x4)) {
+        ((~byte1 & 0x8) == 0x8) && ((byte2 & 0x4) == 0x4)) {
       insn->vectorExtensionType = TYPE_EVEX;
----------------
LuoYuanke wrote:
> This is the same to ((byte1 & 0x8) == 0x0)?
Yes, but I'm not sure if this is intentional. Maybe it keeps the shape in ` & X == X`?

================
Comment at: llvm/lib/Target/X86/X86.td:189
+// guarded under condition hasVLX. So we imply it in FeatureFP16 currently.
+// FIXME: FP16 conversion between f16 and i64 customise type v8i64, which is
+// supposed to be guarded under condition hasDQI. So we imply it in FeatureFP16
----------------
LuoYuanke wrote:
> customize?
customise seems correct too. Anyway, I can change it.

================
Comment at: llvm/lib/Target/X86/X86ISelLowering.cpp:10549

-      if (EltVT == MVT::i32 || EltVT == MVT::f32 || EltVT == MVT::f64 ||
-          (EltVT == MVT::i64 && Subtarget.is64Bit())) {
+      if (EltVT == MVT::i32 || EltVT == MVT::f16 || EltVT == MVT::f32 ||
+          EltVT == MVT::f64 || (EltVT == MVT::i64 && Subtarget.is64Bit()) ||
----------------
LuoYuanke wrote:
> Need check Subtarget.hasFP16()?
No, f16 is legal here, so it implies the feature.

================
Comment at: llvm/lib/Target/X86/X86ISelLowering.cpp:10551
+          EltVT == MVT::f64 || (EltVT == MVT::i64 && Subtarget.is64Bit()) ||
+          (EltVT == MVT::i16 && Subtarget.hasFP16())) {
         assert((VT.is128BitVector() || VT.is256BitVector() ||
----------------
LuoYuanke wrote:
> Why handle i16? Isn't it handled by movw?
No, we don't have a movw instruction.

================
Comment at: llvm/lib/Target/X86/X86ISelLowering.cpp:10744
   // For SSE 4.1, use insertps to put the high elements into the low element.
-  if (Subtarget.hasSSE41()) {
+  if (Subtarget.hasSSE41() && EltVT != MVT::f16) {
     SDValue Result;
----------------
LuoYuanke wrote:
> Why exclude f16? Is there better choice for fp16?
We prefer to using shuffle vector rather than insert_vector_elt here, because we don't have a insert instruction for half type.

================
Comment at: llvm/lib/Target/X86/X86InstrAVX512.td:3878
+}
+let Predicates = [HasFP16, HasVLX] in {
+  def : Pat<(v16f16 (vselect VK16WM:$mask, (v16f16 VR256X:$src1), (v16f16 VR256X:$src0))),
----------------
LuoYuanke wrote:
> Not sure this can be merged to 512 version load/store pattern with muticlass by abstract type info.
I think it is probably feasible. We may add a codegen only opcode to reuse VMOVDQU instruction defination.
But that may need careful tune, so I think we can do it as a followup.

================
Comment at: llvm/lib/Target/X86/X86InstrAVX512.td:4159
+defm VMOVSHZ : avx512_move_scalar<"vmovsh", X86Movsh, X86vzload16, f16x_info,
+                                  [HasFP16]>,
+                                  VEX_LIG, T_MAP5XS, EVEX_CD8<16, CD8VT1>;
----------------
LuoYuanke wrote:
> Why there is no OptForSize for vmovsh?
Good catch. I think we should add it here.

================
Comment at: llvm/lib/Target/X86/X86InstrAVX512.td:4478
+  let Predicates = [HasFP16] in {
+    def VMOVSHZrr_REV: AVX512<0x11, MRMDestReg, (outs VR128X:$dst),
+        (ins VR128X:$src1, VR128X:$src2),
----------------
LuoYuanke wrote:
> Sorry, I forgot what REV stand for. Do you know it?
> Is this just encoding difference for register operand compared with VMOVSHZrr? What is it used for?
I think REV is short for revert. Which allows a different encoding when operands order are reverted.
Yes. It's used for a different encoding.

================
Comment at: llvm/lib/Target/X86/X86RegisterInfo.td:570
 def VR64: RegisterClass<"X86", [x86mmx], 64, (sequence "MM%u", 0, 7)>;
-def VR128 : RegisterClass<"X86", [v4f32, v2f64, v16i8, v8i16, v4i32, v2i64, f128],
+def VR128 : RegisterClass<"X86", [v4f32, v2f64, v8f16, v16i8, v8i16, v4i32, v2i64, f128],
                           128, (add FR32)>;
----------------
LuoYuanke wrote:
> Given there is only EVEX instructions for fp16, is it necessary to add f16 type to it?
I think so. For example, we may use some i16 instructions which may be or may finally turn into AVX2 ones. Adding to it is useful for them since VR128 is subset of VR128X.

================
Comment at: llvm/test/CodeGen/X86/vector-reduce-fmax-nnan.ll:374
+; SSE-NEXT:    movl %edi, %ebp
+; SSE-NEXT:    movzwl %bx, %edi
 ; SSE-NEXT:    callq __gnu_h2f_ieee at PLT
----------------
LuoYuanke wrote:
> Why this test case changes? Shall we add -mattr=+avx512fp16 to run?
Because we allowed one combine after X86ISelLowering.cpp:41180 without check the feature.
Although it seems the code here is correct and better, I'll add the check for feature in case any confusing.
We do have the test for avx512fp16 in D105264.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105263/new/

https://reviews.llvm.org/D105263