<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Feb 3, 2016 at 8:24 AM, Andrea Di Biagio <span dir="ltr"><<a href="mailto:Andrea_DiBiagio@sn.scee.net" target="_blank">Andrea_DiBiagio@sn.scee.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">andreadb added a subscriber: andreadb.<br>
andreadb added a comment.<br>
<span class=""><br>
In <a href="http://reviews.llvm.org/D16837#343006" rel="noreferrer" target="_blank">http://reviews.llvm.org/D16837#343006</a>, @probinson wrote:<br>
<br>
> As long as the consequence of running such code on a non-btver2 CPU is merely performance, not correctness.<br>
> I seem to remember that being a concern in the first attempt at turning off vzeroupper, years ago. Something about the consistency of behavior of code in a library, IIRC, when caller and callee were compiled for different CPUs and did not have the same concept of whether the upper parts had been zeroed. Sorry I don't remember the specifics better than that, and I certainly don't know enough about the microarchitectural details to say one way or the other.<br>
<br>
<br>
</span>My understanding is that this should only affect performance.<br>
<br>
The problem is when you mix legacy SSE instructions with AVX instructions. Legacy SSE instructions do not affect the upper 128-bits of the YMM registers. This may cause false dependencies due to partial register writes.<br>
<br>
So, if a library is built for a non AVX CPU (or if the library cannot avoid using legacy SSE code), the absence of vzeroupper in the code has the potential of causing stalls due to false dependencies (when there is a AVX-SSE transition).<br></blockquote><div><br></div><div><div>It isn't about false dependencies per se. It is about appeasing a certain aspect of certain intel microarchitectures. You can read more about it here: <a href="https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties">https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties</a></div><div><br></div><div>Basically, the expensive part of it is that the chip saves off / restores the upper halves of the ymm registers when transitioning between 256b and 128b modes. vzeroupper (well, and I assume vzeroall too) is the only way to communicate to the processor "don't bother to save that state" (see the article). Even `XOR reg,reg` doesn't work to communicate this according to the article.</div><div><br></div><div>So basically the relevant Intel microarchitectures take a performance penalty much greater than simply a loss of ILP due to false dependencies.</div><div><br></div><div>I assume that this is done for similar reasons to how Jaguar has a side cache for storing x87 registers (frees up space in the PRF or other resources).<br></div><div><br></div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
On AMD Fam 15h processors (and Btver2) there is no penalty for AVX-SSE transitions. This is an important difference with respect to Intel processors where, for each SSE-AVX transition, the hardware saves and restores the upper 128 bits of the YMM registers. I think that is the reason why on Intel, vzeroupper is very fast, while on btver2 vzeroupper is microcoded (and extremely slow!).<br>
Also, (since Fam 15) AMD processors implement an XMM register merge optimization; the hardware keeps track of XMM registers whose upper portions have been cleared to zeros.<br></blockquote><div><br></div><div>It is sort of spurious to say "since Fam 15" (I assume you mean 15h which is Bulldozer; decimal 15 is K8 which is way old and doesn't even have 256b vectors), since the AMD microarchitectures don't have a linear history. For example Jaguar (16h) is a successor of Bobcat (14h). Jaguar is pretty much completely different from Bulldozer (15h). Bulldozer (15h) is the successor of K10 (10h).</div><div><br class="">-- Sean Silva<br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
<br>
<a href="http://reviews.llvm.org/D16837" rel="noreferrer" target="_blank">http://reviews.llvm.org/D16837</a><br>
<br>
<br>
<br>
</blockquote></div><br></div></div>