[llvm] r175043 - Prevent insertion of

Tue Sep 3 12:16:36 PDT 2013

Hi Elena,
Thank you for getting back to me on this issue.
Hmm, maybe my question was not very clear. Here is the output I get by running
the first RUN line inside llvm/trunk/test/CodeGen/X86/avx-intel-ocl.ll:

# llc < %s -mtriple=i686-apple-darwin -mcpu=corei7-avx -mattr=+avx
#
  _test_float4:
    subl    $156, %esp
    vmovups %ymm2, 64(%esp)         ## 32-byte Folded Spill
    vmovups %ymm1, 32(%esp)         ## 32-byte Folded Spill
    vmovups %ymm0, (%esp)           ## 32-byte Folded Spill
                                    ## kill: XMM0<def> XMM0<kill> YMM0<kill>
                                    ## kill: XMM1<def> XMM1<kill> YMM1<kill>
                                    ## kill: XMM2<def> XMM2<kill> YMM2<kill>
    vzeroupper
    calll   L_func_float4$stub
                                    ## kill: XMM0<def> XMM0<kill> YMM0<def>

It looks to me that the vzeroupper instruction is redundant because the three
vmovups instructions have already cleared the higher 128-bits in ymm0-ymm2.

- Gao.

________________________________________
From: Demikhovsky, Elena [elena.demikhovsky at intel.com]
Sent: Saturday, August 31, 2013 10:45 PM
To: Gao, Yunzhong
Cc: llvm-commits at cs.uiuc.edu
Subject: RE: [llvm] r175043 - Prevent insertion of

Hi Gao,

The Intel_ocl_bi is a special calling conventions for the math library used by Intel OpenCL (it is not generic).
According to these conventions, YMM registers are preserved by callee on X64 and Win64 (the set of preserved YMMs is different on these platforms). On X32/Win32 the YMMs are not preserved at all. There are only 8 SIMD registers on 32-bit platform. We worked a lot on performance tuning and found these conventions the most optimal for us.

-  Elena

-----Original Message-----
From: Gao, Yunzhong [mailto:yunzhong_gao at playstation.sony.com]
Sent: Thursday, August 29, 2013 22:22
To: Demikhovsky, Elena
Cc: llvm-commits at cs.uiuc.edu
Subject: RE: [llvm] r175043 - Prevent insertion of

ping.

________________________________________
From: llvm-commits-bounces at cs.uiuc.edu [llvm-commits-bounces at cs.uiuc.edu] on behalf of Gao, Yunzhong
Sent: Friday, August 23, 2013 1:48 PM
To: llvm-commits at cs.uiuc.edu
Subject: Re: [llvm] r175043 - Prevent insertion of

Elena Demikhovsky <elena.demikhovsky at ...> writes:

>
> Author: delena
> Date: Wed Feb 13 02:02:04 2013
> New Revision: 175043
>
> URL: http://llvm.org/viewvc/llvm-project?rev=175043&view=rev
> Log:
> Prevent insertion of "vzeroupper" before call that preserves YMM
registers, since a caller uses
> preserved registers across the call.
>
> Modified:
>     llvm/trunk/lib/Target/X86/X86VZeroUpper.cpp
>     llvm/trunk/test/CodeGen/X86/avx-intel-ocl.ll
>
> Modified: llvm/trunk/lib/Target/X86/X86VZeroUpper.cpp
> URL:
http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86VZeroUpper.cpp?rev=175043&r1=175042&r2=175043&view=diff
> ======================================================================
> ========
> --- llvm/trunk/lib/Target/X86/X86VZeroUpper.cpp (original)
> +++ llvm/trunk/lib/Target/X86/X86VZeroUpper.cpp Wed Feb 13 02:02:04
> +++ 2013
>  <at>  <at>  -120,9 +120,19  <at>  <at>  static bool
checkFnHasLiveInYmm(MachineR
>    return false;
>  }
>
> +static bool clobbersAllYmmRegs(const MachineOperand &MO) {
> +  for (unsigned reg = X86::YMM0; reg < X86::YMM15; ++reg) {
> +    if (!MO.clobbersPhysReg(reg))
> +      return false;
> +  }
> +  return true;
> +}
> +
>  static bool hasYmmReg(MachineInstr *MI) {
>    for (unsigned i = 0, e = MI->getNumOperands(); i != e; ++i) {
>      const MachineOperand &MO = MI->getOperand(i);
> +    if (MI->isCall() && MO.isRegMask() && !clobbersAllYmmRegs(MO))
> +      return true;
>      if (!MO.isReg())
>        continue;
>      if (MO.isDebug())
>
> Modified: llvm/trunk/test/CodeGen/X86/avx-intel-ocl.ll
> URL:
http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx-intel-ocl.ll?rev=175043&r1=175042&r2=175043&view=diff
> ======================================================================
> ========
> --- llvm/trunk/test/CodeGen/X86/avx-intel-ocl.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx-intel-ocl.ll Wed Feb 13 02:02:04
> +++ 2013
>  <at>  <at>  -127,3 +127,43  <at>  <at>  define i32  <at> test_int(i32
> %a,
i32 %b) nou
>      %c = add i32 %c2, %b
>       ret i32 %c
>  }
> +
> +; WIN64: test_float4
> +; WIN64-NOT: vzeroupper
> +; WIN64: call
> +; WIN64-NOT: vzeroupper
> +; WIN64: call
> +; WIN64: ret
> +
> +; X64: test_float4
> +; X64-NOT: vzeroupper
> +; X64: call
> +; X64-NOT: vzeroupper
> +; X64: call
> +; X64: ret
> +
> +; X32: test_float4
> +; X32: vzeroupper
> +; X32: call
> +; X32: vzeroupper
> +; X32: call
> +; X32: ret
> +
> +declare <4 x float>  <at> func_float4(<4 x float>, <4 x float>, <4 x
> +float>)
> +
> +define <8 x float>  <at> test_float4(<8 x float> %a, <8 x float> %b,
> +<8 x
float> %c) nounwind readnone {
> +entry:
> +  %0 = shufflevector <8 x float> %a, <8 x float> undef, <4 x i32>
> +<i32 0,
i32 1, i32 2, i32 3>
> +  %1 = shufflevector <8 x float> %b, <8 x float> undef, <4 x i32>
> + <i32 0,
i32 1, i32 2, i32 3>
> +  %2 = shufflevector <8 x float> %c, <8 x float> undef, <4 x i32>
> + <i32 0,
i32 1, i32 2, i32 3>
> +  %call.i = tail call intel_ocl_bicc <4 x float>  <at> func_float4(<4
> + x
float> %0, <4 x float> %1, <4 x float> %2) nounwind
> +  %3 = shufflevector <4 x float> %call.i, <4 x float> undef, <8 x
> + i32>
<i32 0, i32 1, i32 2, i32 3, i32 undef, i32
> undef, i32 undef, i32 undef>
> +  %4 = shufflevector <8 x float> %a, <8 x float> undef, <4 x i32>
> + <i32 4,
i32 5, i32 6, i32 7>
> +  %5 = shufflevector <8 x float> %b, <8 x float> undef, <4 x i32>
> + <i32 4,
i32 5, i32 6, i32 7>
> +  %6 = shufflevector <8 x float> %c, <8 x float> undef, <4 x i32>
> + <i32 4,
i32 5, i32 6, i32 7>
> +  %call.i2 = tail call intel_ocl_bicc <4 x float>  <at>
> + func_float4(<4 x
float> %4, <4 x float> %5, <4 x float> %6) nounwind
> +  %7 = shufflevector <4 x float> %call.i2, <4 x float> undef, <8 x
> + i32>
<i32 0, i32 1, i32 2, i32 3, i32 undef, i32
> undef, i32 undef, i32 undef>
> +  %8 = shufflevector <8 x float> %3, <8 x float> %7, <8 x i32> <i32
> + 0,
i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
> +  ret <8 x float> %8
> +}
> +
>

Hi Elena,
I would like to discuss with you about your commit r175043.

In X86VZeroUpper.cpp, how does clobbersAllYmmRegs() check that YMM registers are preserved? It seems that this function is only checking that at least one YMM register is not written/clobbered.

In avx-intel-ocl.ll, why does X32 have different check sequence than X64 and WIN64? It seems that test_float4() is trying to create some undef values in the upper 128 bits of a YMM register with the following IR instructions:

+  %0 = shufflevector <8 x float> %a, <8 x float> undef, <4 x i32> <i32
+ 0,
i32 1, i32 2, i32 3>
+  %1 = shufflevector <8 x float> %b, <8 x float> undef, <4 x i32> <i32
+ 0,
i32 1, i32 2, i32 3>
+  %2 = shufflevector <8 x float> %c, <8 x float> undef, <4 x i32> <i32
+ 0,
i32 1, i32 2, i32 3>

But the X86 backend is generating AVX vmovups instructions for both X32 and X64 cases, which will clear the upper 128 bits instead of leaving undef values there. So it seems that vzeroupper is not needed in either case, right?

Thanks,
- Gao.

_______________________________________________
llvm-commits mailing list
llvm-commits at cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.