[LLVMdev] Proposal to improve vzeroupper optimization strategy

Thu Sep 19 12:04:57 PDT 2013

Great idea.  I reported on this problem before and glad to see someone trying to tackle this.

cheers.

________________________________________
From: llvmdev-bounces at cs.uiuc.edu [llvmdev-bounces at cs.uiuc.edu] on behalf of Gao, Yunzhong [yunzhong_gao at playstation.sony.com]
Sent: Thursday, September 19, 2013 11:53 AM
To: llvmdev at cs.uiuc.edu
Subject: [LLVMdev] Proposal to improve vzeroupper optimization strategy

Hi all,

I would like to make a proposal about changing the optimization strategy
regarding when to insert a vzeroupper instruction in the x86 backend.

Current implementation:
vzeroupper is inserted to any functions that use AVX instructions. The
insertion points are:
1) before a call instruction;
2) before a return instruction;

Rationale:
vzeroupper is an AVX instruction; it is inserted to avoid performance penalty
when switching between x86 AVX mode and SSE mode, e.g., when an AVX function
calls a SSE function.

My proposal:
Default to not insert vzeroupper instruction unless a function is using legacy
SSE instructions. By a legacy SSE instruction, I mean any vector instructions
that do not have a v- prefix, write XMM register but not YMM register. If a
legacy SSE instruction is spotted, then insert a vzeroupper instruction:
1) before a call instruction;
2) before a return instruction;

Explanation:
If all application and libraries are compiled with the same toolchain, then
with this proposal, a function can assume that incoming AVX registers have
their top 128 bits either specified or zeroed. Assuming that legacy SSE
instructions will be seldom generated, it should be rare to have to emit
vzeroupper instructions, which is a slow instruction by itself.

Possible problem:
This proposal is biased towards the situation when all applications and
libraries are compiled with the same toolchain. If it is common case to mix and
match applications built with different toolchains, this approach might lead to
situations when a vzeroupper instruction is missing when calling from a
LLVM-compiled AVX function to a foreign-compiled SSE function, hence a
transition penalty. One possible solution around this issue is to add a
function attribute which specifies whether the caller and callee have the
same architecture. e.g.,
extern int foo __attribute__((nolegacy));
would declare an external function that does not use legacy SSE instruction.

Any thoughts?
- Gao.

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev