[PATCH] D16837: Disable the vzeroupper insertion pass on PS4

Wed Feb 3 08:24:45 PST 2016

andreadb added a subscriber: andreadb.
andreadb added a comment.

In http://reviews.llvm.org/D16837#343006, @probinson wrote:

> As long as the consequence of running such code on a non-btver2 CPU is merely performance, not correctness.
>  I seem to remember that being a concern in the first attempt at turning off vzeroupper, years ago.  Something about the consistency of behavior of code in a library, IIRC, when caller and callee were compiled for different CPUs and did not have the same concept of whether the upper parts had been zeroed.  Sorry I don't remember the specifics better than that, and I certainly don't know enough about the microarchitectural details to say one way or the other.

My understanding is that this should only affect performance.

The problem is when you mix legacy SSE instructions with AVX instructions. Legacy SSE instructions do not affect the upper 128-bits of the YMM registers. This may cause false dependencies due to partial register writes.

So, if a library is built for a non AVX CPU (or if the library cannot avoid using legacy SSE code), the absence of vzeroupper in the code has the potential of causing stalls due to false dependencies (when there is a AVX-SSE transition).

On AMD Fam 15h processors (and Btver2) there is no penalty for AVX-SSE transitions. This is an important difference with respect to Intel processors where, for each SSE-AVX transition, the hardware saves and restores the upper 128 bits of the YMM registers. I think that is the reason why on Intel, vzeroupper is very fast, while on btver2 vzeroupper is microcoded (and extremely slow!).
Also, (since Fam 15) AMD processors implement an XMM register merge optimization; the hardware keeps track of XMM registers whose upper portions have been cleared to zeros.

http://reviews.llvm.org/D16837