[PATCH] D16837: Disable the vzeroupper insertion pass on PS4

Wed Feb 3 11:49:18 PST 2016

On Wed, Feb 3, 2016 at 8:24 AM, Andrea Di Biagio <
Andrea_DiBiagio at sn.scee.net> wrote:

> andreadb added a subscriber: andreadb.
> andreadb added a comment.
>
> In http://reviews.llvm.org/D16837#343006, @probinson wrote:
>
> > As long as the consequence of running such code on a non-btver2 CPU is
> merely performance, not correctness.
> >  I seem to remember that being a concern in the first attempt at turning
> off vzeroupper, years ago.  Something about the consistency of behavior of
> code in a library, IIRC, when caller and callee were compiled for different
> CPUs and did not have the same concept of whether the upper parts had been
> zeroed.  Sorry I don't remember the specifics better than that, and I
> certainly don't know enough about the microarchitectural details to say one
> way or the other.
>
>
> My understanding is that this should only affect performance.
>
> The problem is when you mix legacy SSE instructions with AVX instructions.
> Legacy SSE instructions do not affect the upper 128-bits of the YMM
> registers. This may cause false dependencies due to partial register writes.
>
> So, if a library is built for a non AVX CPU (or if the library cannot
> avoid using legacy SSE code), the absence of vzeroupper in the code has the
> potential of causing stalls due to false dependencies (when there is a
> AVX-SSE transition).
>

It isn't about false dependencies per se. It is about appeasing a certain
aspect of certain intel microarchitectures. You can read more about it
here:
https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties

Basically, the expensive part of it is that the chip saves off / restores
the upper halves of the ymm registers when transitioning between 256b and
128b modes. vzeroupper (well, and I assume vzeroall too) is the only way to
communicate to the processor "don't bother to save that state" (see the
article). Even `XOR reg,reg` doesn't work to communicate this according to
the article.

So basically the relevant Intel microarchitectures take a performance
penalty much greater than simply a loss of ILP due to false dependencies.

I assume that this is done for similar reasons to how Jaguar has a side
cache for storing x87 registers (frees up space in the PRF or other
resources).

>
> On AMD Fam 15h processors (and Btver2) there is no penalty for AVX-SSE
> transitions. This is an important difference with respect to Intel
> processors where, for each SSE-AVX transition, the hardware saves and
> restores the upper 128 bits of the YMM registers. I think that is the
> reason why on Intel, vzeroupper is very fast, while on btver2 vzeroupper is
> microcoded (and extremely slow!).
> Also, (since Fam 15) AMD processors implement an XMM register merge
> optimization; the hardware keeps track of XMM registers whose upper
> portions have been cleared to zeros.
>

It is sort of spurious to say "since Fam 15" (I assume you mean 15h which
is Bulldozer; decimal 15 is K8 which is way old and doesn't even have 256b
vectors), since the AMD microarchitectures don't have a linear history. For
example Jaguar (16h) is a successor of Bobcat (14h). Jaguar is pretty much
completely different from Bulldozer (15h). Bulldozer (15h) is the successor
of K10 (10h).

-- Sean Silva

>
>
> http://reviews.llvm.org/D16837
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160203/a9db4c10/attachment.html>