[LLVMdev] Proposal to improve vzeroupper optimization strategy

Fri Sep 20 14:52:47 PDT 2013

Hi Manny,
Thanks! You said that you reported on this problem before, do you know whether there is an
existing LLVM bugzilla ticket for this issue?
- Gao.


> -----Original Message-----
> From: Manny Ko [mailto:Manny.Ko at imgtec.com]
> Sent: Thursday, September 19, 2013 12:05 PM
> To: Gao, Yunzhong; llvmdev at cs.uiuc.edu
> Subject: RE: Proposal to improve vzeroupper optimization strategy
> 
> Great idea.  I reported on this problem before and glad to see someone
> trying to tackle this.
> 
> cheers.
> 
> ________________________________________
> From: llvmdev-bounces at cs.uiuc.edu [llvmdev-bounces at cs.uiuc.edu] on
> behalf of Gao, Yunzhong [yunzhong_gao at playstation.sony.com]
> Sent: Thursday, September 19, 2013 11:53 AM
> To: llvmdev at cs.uiuc.edu
> Subject: [LLVMdev] Proposal to improve vzeroupper optimization strategy
> 
> Hi all,
> 
> I would like to make a proposal about changing the optimization strategy
> regarding when to insert a vzeroupper instruction in the x86 backend.
> 
> Current implementation:
> vzeroupper is inserted to any functions that use AVX instructions. The
> insertion points are:
> 1) before a call instruction;
> 2) before a return instruction;
> 
> Rationale:
> vzeroupper is an AVX instruction; it is inserted to avoid performance penalty
> when switching between x86 AVX mode and SSE mode, e.g., when an AVX
> function calls a SSE function.
> 
> My proposal:
> Default to not insert vzeroupper instruction unless a function is using legacy
> SSE instructions. By a legacy SSE instruction, I mean any vector instructions
> that do not have a v- prefix, write XMM register but not YMM register. If a
> legacy SSE instruction is spotted, then insert a vzeroupper instruction:
> 1) before a call instruction;
> 2) before a return instruction;
> 
> Explanation:
> If all application and libraries are compiled with the same toolchain, then with
> this proposal, a function can assume that incoming AVX registers have their
> top 128 bits either specified or zeroed. Assuming that legacy SSE instructions
> will be seldom generated, it should be rare to have to emit vzeroupper
> instructions, which is a slow instruction by itself.
> 
> Possible problem:
> This proposal is biased towards the situation when all applications and
> libraries are compiled with the same toolchain. If it is common case to mix
> and match applications built with different toolchains, this approach might
> lead to situations when a vzeroupper instruction is missing when calling from
> a LLVM-compiled AVX function to a foreign-compiled SSE function, hence a
> transition penalty. One possible solution around this issue is to add a function
> attribute which specifies whether the caller and callee have the same
> architecture. e.g., extern int foo __attribute__((nolegacy)); would declare an
> external function that does not use legacy SSE instruction.
> 
> Any thoughts?
> - Gao.
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev