<div dir="ltr">Great! Glad to see you are working on this.</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Sep 19, 2013 at 3:04 PM, Manny Ko <span dir="ltr"><<a href="mailto:Manny.Ko@imgtec.com" target="_blank">Manny.Ko@imgtec.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Great idea.  I reported on this problem before and glad to see someone trying to tackle this.<br>

<br>

cheers.<br>

<br>

________________________________________<br>

From: <a href="mailto:llvmdev-bounces@cs.uiuc.edu">llvmdev-bounces@cs.uiuc.edu</a> [<a href="mailto:llvmdev-bounces@cs.uiuc.edu">llvmdev-bounces@cs.uiuc.edu</a>] on behalf of Gao, Yunzhong [<a href="mailto:yunzhong_gao@playstation.sony.com">yunzhong_gao@playstation.sony.com</a>]<br>


Sent: Thursday, September 19, 2013 11:53 AM<br>

To: <a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a><br>

Subject: [LLVMdev] Proposal to improve vzeroupper optimization strategy<br>

<div class="HOEnZb"><div class="h5"><br>

Hi all,<br>

<br>

I would like to make a proposal about changing the optimization strategy<br>

regarding when to insert a vzeroupper instruction in the x86 backend.<br>

<br>

Current implementation:<br>

vzeroupper is inserted to any functions that use AVX instructions. The<br>

insertion points are:<br>

1) before a call instruction;<br>

2) before a return instruction;<br>

<br>

Rationale:<br>

vzeroupper is an AVX instruction; it is inserted to avoid performance penalty<br>

when switching between x86 AVX mode and SSE mode, e.g., when an AVX function<br>

calls a SSE function.<br>

<br>

My proposal:<br>

Default to not insert vzeroupper instruction unless a function is using legacy<br>

SSE instructions. By a legacy SSE instruction, I mean any vector instructions<br>

that do not have a v- prefix, write XMM register but not YMM register. If a<br>

legacy SSE instruction is spotted, then insert a vzeroupper instruction:<br>

1) before a call instruction;<br>

2) before a return instruction;<br>

<br>

Explanation:<br>

If all application and libraries are compiled with the same toolchain, then<br>

with this proposal, a function can assume that incoming AVX registers have<br>

their top 128 bits either specified or zeroed. Assuming that legacy SSE<br>

instructions will be seldom generated, it should be rare to have to emit<br>

vzeroupper instructions, which is a slow instruction by itself.<br>

<br>

Possible problem:<br>

This proposal is biased towards the situation when all applications and<br>

libraries are compiled with the same toolchain. If it is common case to mix and<br>

match applications built with different toolchains, this approach might lead to<br>

situations when a vzeroupper instruction is missing when calling from a<br>

LLVM-compiled AVX function to a foreign-compiled SSE function, hence a<br>

transition penalty. One possible solution around this issue is to add a<br>

function attribute which specifies whether the caller and callee have the<br>

same architecture. e.g.,<br>

extern int foo __attribute__((nolegacy));<br>

would declare an external function that does not use legacy SSE instruction.<br>

<br>

Any thoughts?<br>

- Gao.<br>

<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

</div></div></blockquote></div><br></div>