[PATCH] D107082: [X86][RFC] Enable `_Float16` type support on X86 following the psABI

Thu Jun 9 09:04:04 PDT 2022

pengfei added inline comments.

================
Comment at: llvm/test/Analysis/CostModel/X86/fptoi_sat.ll:852
+; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %f16u1 = call i1 @llvm.fptoui.sat.i1.f16(half undef)
+; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %f16s8 = call i8 @llvm.fptosi.sat.i8.f16(half undef)
+; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %f16u8 = call i8 @llvm.fptoui.sat.i8.f16(half undef)
----------------
LuoYuanke wrote:
> It seems the cost is reduced in general. Is it because we pass/return f16 by xmm register?
No. It's because we don't have cost model for `f16`. I added some in D127386 to address this.

================
Comment at: llvm/test/CodeGen/MIR/X86/inline-asm-registers.mir:31
   ; CHECK-LABEL: name: test
-  ; CHECK: INLINEASM &foo, 0 /* attdialect */, 4390922 /* regdef:GR64 */, def $rsi, 4390922 /* regdef:GR64 */, def dead $rdi,
-    INLINEASM &foo, 0, 4390922, def $rsi, 4390922, def dead $rdi, 2147549193, killed $rdi, 2147483657, killed $rsi, 12, implicit-def dead early-clobber $eflags
+  ; CHECK: INLINEASM &foo, 0 /* attdialect */, 4456458 /* regdef:GR64 */, def $rsi, 4456458 /* regdef:GR64 */, def dead $rdi,
+    INLINEASM &foo, 0, 4456458, def $rsi, 4456458, def dead $rdi, 2147549193, killed $rdi, 2147483657, killed $rsi, 12, implicit-def dead early-clobber $eflags
----------------
LuoYuanke wrote:
> Why f16 patch affect this test case? There is no fp instruction in this test case.
I this it's newly added `FR16` that affects all number the other number register class. We met the problem when enabling FP16 too.

================
Comment at: llvm/test/CodeGen/X86/atomic-non-integer.ll:253
+; X64-SSE-NEXT:    movzwl (%rdi), %eax
+; X64-SSE-NEXT:    pinsrw $0, %eax, %xmm0
+; X64-SSE-NEXT:    retq
----------------
LuoYuanke wrote:
> I notice X86-SSE1 return by GPR. Should we also return by GPR for X64-SSE?
No. The result in X86-SSE in UB. We support the emulation on SSE2 and later.

================
Comment at: llvm/test/CodeGen/X86/avx512-insert-extract.ll:2307
+; SKX-NEXT:    vmovd %ecx, %xmm0
+; SKX-NEXT:    vcvtph2ps %xmm0, %xmm0
+; SKX-NEXT:    vmovss %xmm0, %xmm0, %xmm0 {%k2} {z}
----------------
LuoYuanke wrote:
> Is code less efficient than previous code? Why previous code still works without convert half to float?
Yes. The previous code using `i16` for FP16. Improved, thanks!

================
Comment at: llvm/test/CodeGen/X86/callbr-asm-bb-exports.ll:20
 ; CHECK-NEXT: t22: ch,glue = CopyToReg t17, Register:i32 %5, t8
-; CHECK-NEXT: t30: ch,glue = inlineasm_br t22, TargetExternalSymbol:i64'xorl $0, $0; jmp ${1:l}', MDNode:ch<null>, TargetConstant:i64<8>, TargetConstant:i32<2293769>, Register:i32 %5, TargetConstant:i64<13>, TargetBlockAddress:i64<@test, %fail> 0, TargetConstant:i32<12>, Register:i32 $df, TargetConstant:i32<12>, Register:i16 $fpsw, TargetConstant:i32<12>, Register:i32 $eflags, t22:1
+; CHECK-NEXT: t30: ch,glue = inlineasm_br t22, TargetExternalSymbol:i64'xorl $0, $0; jmp ${1:l}', MDNode:ch<null>, TargetConstant:i64<8>, TargetConstant:i32<2359305>, Register:i32 %5, TargetConstant:i64<13>, TargetBlockAddress:i64<@test, %fail> 0, TargetConstant:i32<12>, Register:i32 $df, TargetConstant:i32<12>, Register:i16 $fpsw, TargetConstant:i32<12>, Register:i32 $eflags, t22:1

----------------
LuoYuanke wrote:
> Why this test is affacted? Is it caused by calling convention change?
No. It's caused by newly added `FR16` register class.

================
Comment at: llvm/test/CodeGen/X86/fmf-flags.ll:115
-; X64-NEXT:    movzwl %di, %edi
-; X64-NEXT:    callq __gnu_h2f_ieee at PLT
 ; X64-NEXT:    mulss {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
----------------
LuoYuanke wrote:
> Does __gnu_h2f_ieee retrun from xmm?
There does not exist a `__gnu_h2f_ieee` on X86 before. It's ARM/AArch64 specific.

================
Comment at: llvm/test/CodeGen/X86/fpclamptosat.ll:569
 ; CHECK-NEXT:    cvttss2si %xmm0, %rax
 ; CHECK-NEXT:    ucomiss {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; CHECK-NEXT:    movabsq $-9223372036854775808, %rcx # imm = 0x8000000000000000
----------------
LuoYuanke wrote:
> I'm curious why there is 1 more compare in this patch. 
It's an optimization implemented by D111976. We don't meet the requirment that `isOperationLegalOrCustom`. It's not easy to solve because we need to check the promoted type instead. I'll leave it as is.

================
Comment at: llvm/test/CodeGen/X86/fpclamptosat.ll:776
+; CHECK-NEXT:    cmovael %eax, %ecx
+; CHECK-NEXT:    ucomiss {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; CHECK-NEXT:    movl $2147483647, %edx # imm = 0x7FFFFFFF
----------------
LuoYuanke wrote:
> Ditto.
The same as above.

================
Comment at: llvm/test/CodeGen/X86/fpclamptosat_vec.ll:605
+; CHECK-NEXT:    .cfi_def_cfa_offset 80
+; CHECK-NEXT:    movss %xmm2, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
+; CHECK-NEXT:    movss %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
----------------
LuoYuanke wrote:
> Is the vector <4 x half> split to 4 scalar and pass by xmm? What's the ABI for vector half? Is there any case that test the scenario that run out of register and pass parameter through stack?
Good question! Previously, I discussed with GCC folks we won't support vector in emulation. I expected the FE with pass whole vector through stack. So a vector in IR is illegal to ABI and can be splited.
But seems GCC passes it by vector register. https://godbolt.org/z/a67rMhTW6
I'll double confirm with GCC folks.

================
Comment at: llvm/test/CodeGen/X86/fptosi-sat-scalar.ll:2138
+; X64-NEXT:    ucomiss {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; X64-NEXT:    movl $255, %eax
+; X64-NEXT:    cmovael %ecx, %eax
----------------
LuoYuanke wrote:
> It seems less efficient than previous code on NAN, zero handling, but we can improve later.
Yes. Added FIXMEs.

================
Comment at: llvm/test/CodeGen/X86/half.ll:946
+; CHECK-I686-NEXT:    calll __extendhfsf2
+; CHECK-I686-NEXT:    fstps {{[0-9]+}}(%esp)
+; CHECK-I686-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
----------------
LuoYuanke wrote:
> Why the x87 instruction is generated?
On 32 bit, float and double are passed by x87 register.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D107082/new/

https://reviews.llvm.org/D107082