[llvm-dev] Is it legal to pass a half by value on x86_64?

Fri Mar 5 13:44:18 PST 2021

I think because the argument got passed in memory and was immediately
stored to a local variable it triggered some copy elision code. And
something went wrong. In the basic blocks where the %c2_in2_ alloca is
loaded from, in the assembly I see 16-bit loads that are loading the same
location %xmm0 is being loaded from in the entry block. So those loads are
accessing the argument directly instead of a local copy. I guess something
about this copy elision lost the knowledge that it needed to be converted
or maybe it shouldn't be eligible for copy elision.

~Craig

On Fri, Mar 5, 2021 at 12:57 PM Jason Hafer <jhafer at mathworks.com> wrote:

> Hi Craig,
>
> I am sorry for my poor example, probably better to take me out of the
> middle.
> I have attached the complete IR for the example on which I am working.
> c2_foo() is where we break down.
>
> Cheers.
>
> JP
>
>
> ------------------------------
> *From:* Craig Topper <craig.topper at gmail.com>
> *Sent:* Friday, March 5, 2021 1:23 PM
> *To:* Wang, Pengfei <pengfei.wang at intel.com>
> *Cc:* Jason Hafer <jhafer at mathworks.com>; llvm-dev <
> llvm-dev at lists.llvm.org>
> *Subject:* Re: Is it legal to pass a half by value on x86_64?
>
> For this code the half store from the IR appears to have been removed
> because it is a local variable that was never read from. The store that
> says "4-byte Spill" is a different store and seems to be some -O0 artifact.
> With -O2 the whole thing becomes just a ret.
>
> define void @foo(i8, i8, i8, i8, half) {
> ; CHECK-I686:    callq __gnu_f2h_ieee
>
>   %6 = alloca half
>   store half %4, half* %6, align 1
>   ret void
> }
>
> x86_64-pc-windows gives:
> push rax
> .seh_stackalloc 8
> .seh_endprologue
> movss xmm0, dword ptr [rsp + 48] # xmm0 = mem[0],zero,zero,zero
> movss dword ptr [rsp + 4], xmm0 # 4-byte Spill
> pop rax
> ret
> .seh_handlerdata
> .text
> .seh_endproc
>
>
> As an experiment, I tried this which does produce a call to __gnu_f2h_ieee
> on windows with llvm 8.0 and llvm 10.0
>
> define void @foo(half*, i8, i8, half) {
> store half %3, half* %0, align 1
> ret void
> }
>
>
> For this assembly you provided, I don't see any reads from xmm0, or any
> word stores. So it's hard for me to determine what might be going wrong.
> Can provide the assembly where xmm0 is eventually used?
>
> mov rax, qword ptr [rsp + 424]
>  movss xmm0, dword ptr [rsp + 416] # xmm0 = mem[0],zero,zero,zero  # <--
> moves the data like it wants to convert but never does
>  mov qword ptr [rsp + 344], rcx
>  mov qword ptr [rsp + 336], rdx
>  mov qword ptr [rsp + 328], r8
>  mov qword ptr [rsp + 320], r9
>  mov qword ptr [rsp + 304], 0
>  mov qword ptr [rsp + 296], 0
>  mov qword ptr [rsp + 288], 0
>  mov qword ptr [rsp + 280], 0
>  mov rcx, qword ptr [rsp + 328]
>  mov qword ptr [rsp + 272], rcx
>  mov rcx, qword ptr [rsp + 328]
>  mov rcx, qword ptr [rcx + 8]
>  mov qword ptr [rsp + 264], rcx
>  mov rcx, qword ptr [rsp + 336]
>  mov rcx, qword ptr [rcx + 56]
>  mov qword ptr [rsp + 256], rcx
>  mov dword ptr [rsp + 312], 0
>  mov qword ptr [rsp + 248], rax # 8-byte Spill
>  movss dword ptr
>
>
> ~Craig
>
>
> On Fri, Mar 5, 2021 at 6:46 AM Wang, Pengfei <pengfei.wang at intel.com>
> wrote:
>
> Hi Jason,
>
>
>
> The different behavior between Linux and Windows comes form the difference
> of the calling conversion. Windows uses 4 registers for arguments passing
> which Linux uses 6.
>
>
> https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-160#parameter-passing
>
>
>
> Thanks
>
> Pengfei
>
>
>
> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> *On Behalf Of *Jason
> Hafer via llvm-dev
> *Sent:* Friday, March 5, 2021 10:21 PM
> *To:* Craig Topper <craig.topper at gmail.com>
> *Cc:* llvm-dev at lists.llvm.org
> *Subject:* Re: [llvm-dev] Is it legal to pass a half by value on x86_64?
>
>
>
> Hi All,
>
>
>
> Thank you very much for all the great information.  This is awesome!
>
>
>
> To circle back on Craig's questions.
>
> I did notice LLVM 11 behave very differently.
>
>
>
> ** Per: What does "incorrect math operations" mean?
>
> The half is passed to the function as a float.  The function does
> operations with other half numbers.  On Windows when we don't get the float
> to half conversation the input is always truncated to 0.0.
>
>
>
> ** Per: "Do you have a more complete IR file for Windows that I can take a
> look at?"
>
> I can get you our IR if you want, but I think it is more convoluted than
> required.  I was working on a unit test and I think all one needs to see
> the anomaly is:
>
> define void @foo(i8, i8, i8, i8, half) {
>
> ; CHECK-I686:    callq __gnu_f2h_ieee
>
>
>
>   %6 = alloca half
>
>   store half %4, half* %6, align 1
>
>   ret void
>
> }
>
> x86_64-pc-windows gives:
> push rax
>
> .seh_stackalloc 8
>
> .seh_endprologue
>
> movss xmm0, dword ptr [rsp + 48] # xmm0 = mem[0],zero,zero,zero
>
> movss dword ptr [rsp + 4], xmm0 # 4-byte Spill
>
> pop rax
>
> ret
>
> .seh_handlerdata
>
> .text
>
> .seh_endproc
>
>
>
> What I find extremely interesting is the behavior seems has something to
> do with the stack?  For dropping the inputs by one then even Windows will
> generate the conversion.
>
>
>
> define void @foo(i8, i8, i8, half) {
>
> ; CHECK-I686:    callq __gnu_f2h_ieee
>
>
>
>   %5 = alloca half
>
>   store half %3, half* %5, align 1
>
>   ret void
>
> }
>
>
>
> x86_64-pc-windows gives:
>
> sub rsp, 40
>
> .seh_stackalloc 40
>
> .seh_endprologue
>
> movabs rax, offset __gnu_f2h_ieee
>
> movaps xmm0, xmm3
>
> call rax
>
> mov word ptr [rsp + 38], ax
>
> add rsp, 40
>
> ret
>
> .seh_handlerdata
>
> .text
>
> .seh_endproc
>
>
>
>
>
> ** If interested, here is a dissection of our real asm.
> For both Windows and Linux our IR calls c2_foo() with a half(2):
>
> ...
>
> call void @c2_foo(i8* %S_6, [21 x i8*]* %ptr_gvar_instance_7, %emlrtStack*
> %c2_b_st_, [18 x float]* @15, half 0xH4000, [18 x i8]* %t10)
>
>
>
> They both register this in c2_foo as:
>
> ...
>
>   %c2_in2_ = alloca half
>
>   store half %c2_in2, half* %c2_in2_, align 1
>
>
>
> When we compile them, they both send 0x40000000 to c2_foo (a single).
>
> The Linux c2_foo() asm addresses this with a float2half conversion:
>
> ...
>
>  mov qword ptr [rsp + 448], rdi
>
>  mov qword ptr [rsp + 440], rsi
>
>  mov qword ptr [rsp + 432], rdx
>
>  mov qword ptr [rsp + 424], rcx
>
>  movabs rcx, offset __gnu_f2h_ieee     # <---Convert Here
>
>  mov qword ptr [rsp + 336], r8 # 8-byte Spill
>
>  call rcx
>
>  mov word ptr [rsp + 422], ax
>
>  mov rcx, qword ptr [rsp + 336] # 8-byte Reload
>
>  mov qword ptr [rsp + 408], rcx
>
>  mov qword ptr [rsp + 392], 0
>
>  mov qword ptr [rsp + 384], 0
>
>  mov qword ptr [rsp + 376], 0
>
>  mov qword ptr [rsp + 368], 0
>
>  mov rdx, qword ptr [rsp + 432]
>
>  mov qword ptr [rsp + 360], rdx
>
>  mov rdx, qword ptr [rsp + 432]
>
>  mov rdx, qword ptr [rdx + 8]
>
>  mov qword ptr [rsp + 352], rdx
>
>  mov rdx, qword ptr [rsp + 440]
>
>  mov rdx, qword ptr [rdx + 56]
>
>  mov qword ptr [rsp + 344], rdx
>
>  mov dword ptr [rsp + 400], 0
>
>  jmp .LBB9_9
>
>
>
> The Windows c2_foo() asm is missing this conversion but treats the value
> as if it has been converted.
>
> ...
>
>  mov rax, qword ptr [rsp + 424]
>
>  movss xmm0, dword ptr [rsp + 416] # xmm0 = mem[0],zero,zero,zero  # <--
> moves the data like it wants to convert but never does
>
>  mov qword ptr [rsp + 344], rcx
>
>  mov qword ptr [rsp + 336], rdx
>
>  mov qword ptr [rsp + 328], r8
>
>  mov qword ptr [rsp + 320], r9
>
>  mov qword ptr [rsp + 304], 0
>
>  mov qword ptr [rsp + 296], 0
>
>  mov qword ptr [rsp + 288], 0
>
>  mov qword ptr [rsp + 280], 0
>
>  mov rcx, qword ptr [rsp + 328]
>
>  mov qword ptr [rsp + 272], rcx
>
>  mov rcx, qword ptr [rsp + 328]
>
>  mov rcx, qword ptr [rcx + 8]
>
>  mov qword ptr [rsp + 264], rcx
>
>  mov rcx, qword ptr [rsp + 336]
>
>  mov rcx, qword ptr [rcx + 56]
>
>  mov qword ptr [rsp + 256], rcx
>
>  mov dword ptr [rsp + 312], 0
>
>  mov qword ptr [rsp + 248], rax # 8-byte Spill
>
>  movss dword ptr
>
>
>
>
>
>
> ------------------------------
>
> *From:* Wang, Pengfei <pengfei.wang at intel.com>
> *Sent:* Friday, March 5, 2021 7:30 AM
> *To:* Sjoerd Meijer <Sjoerd.Meijer at arm.com>; Jason Hafer <
> jhafer at mathworks.com>
> *Cc:* llvm-dev <llvm-dev at lists.llvm.org>
> *Subject:* RE: Is it legal to pass a half by value on x86_64?
>
>
>
> I guess it’s designed for language portability. You can use this type
> across different platforms. Nevertheless, I’m not a FE expert, so I cannot
> think out other intentions.
>
> The _Float16 is a primitive type in the latest x86 ABI, but there’s no X86
> target that supports it yet. So you cannot use it on X86 by now. I think
> that’s the difference from __fp16 and why should use it.
>
> We also have some discussion here. https://reviews.llvm.org/D97318
>
>
>
> Thanks
>
> Pengfei
>
>
>
> *From:* Sjoerd Meijer <Sjoerd.Meijer at arm.com>
> *Sent:* Friday, March 5, 2021 5:49 PM
> *To:* Jason Hafer <jhafer at mathworks.com>; Wang, Pengfei <
> pengfei.wang at intel.com>
> *Cc:* llvm-dev <llvm-dev at lists.llvm.org>
> *Subject:* Re: Is it legal to pass a half by value on x86_64?
>
>
>
> __fp16 is a pure storage format. You cannot pass it by value, because only
>  ABI <https://gitlab.com/x86-psABIs/x86-64-ABI> permissive types can be
> passed by value while __fp16 is not one of them.
>
> Yep. Any specific reason to use a pure storage format? The native type is
> _Float16 and would give some benefits, but this is not yet supported on
> x86, see also:
>
>
>
>
> https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point
>
>
>
> Cheers,
> Sjoerd.
> ------------------------------
>
> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Wang,
> Pengfei via llvm-dev <llvm-dev at lists.llvm.org>
> *Sent:* 05 March 2021 06:28
> *To:* Jason Hafer <jhafer at mathworks.com>
> *Cc:* llvm-dev <llvm-dev at lists.llvm.org>
> *Subject:* Re: [llvm-dev] Is it legal to pass a half by value on x86_64?
>
>
>
> Hi Jason,
>
>
>
> __fp16 is a pure storage format. You cannot pass it by value, because only
> ABI <https://gitlab.com/x86-psABIs/x86-64-ABI> permissive types can be
> passed by value while __fp16 is not one of them.
>
>
>
>    - if "define void @foo(i8, i8, i8, i8, half) " is even legal to use
>
> half as a target independent type is legal for LLVM. It’s not legal for
> unsupported target like X86. The behavior depends on how we lowering it.
> But I don’t know why there’s differences between Linux and Windows. Maybe
> because “__gnu_f2h_ieee” is a Linux only function?
>
>
>
> Thanks
>
> Pengfei
>
>
>
> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> *On Behalf Of *Jason
> Hafer via llvm-dev
> *Sent:* Friday, March 5, 2021 10:46 AM
> *To:* llvm-dev at lists.llvm.org
> *Cc:* Jason Hafer <jhafer at mathworks.com>
> *Subject:* [llvm-dev] Is it legal to pass a half by value on x86_64?
>
>
>
> Hello,
>
>
>
> I am attempting to understand an anomaly I am seeing when dealing with
> half on Windows and could use some help.
>
>
>
> Using LLVM 8 or 10, if I have IR of the flavor below:
> define void @foo(i8, i8, i8, i8, half) {
>
>   %6 = alloca half
>
>   store half %4, half* %6, align 1
>
>   ...
>
>   ret void
>
> }
>
>
>
> Using x86_64-pc-linux, we convert the float passed in with __gnu_f2h_ieee.
>
> Using x86_64-pc-windows I do not get the conversion, so we end up with
> incorrect math operations.
>
>
>
> While investigating I noticed clang gave me the error below:
>
> error: parameters cannot have __fp16 type; did you forget * ?
> void foo(int dc1, int dc2,int dc3,int dc4, __fp16 in)
>
>
>
> So, this got me wondering if "define void @foo(i8, i8, i8, i8, half) " is
> even legal to use or if I should rather pass by ref?  I have yet to find
> documentation to convince me one way or the other.  Thus, I was hoping
> someone here might be able to shed some light on the issue.
>
>
>
> Thank you in advance!
>
>
>
> Cheers,
>
>
>
> JP
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210305/8833068f/attachment.html>