[PATCH] D41723: Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre..

Mon Jan 8 12:00:53 PST 2018

probinson added a comment.

In https://reviews.llvm.org/D41723#969167, @chandlerc wrote:

> But retpoline doesn't make every indirection more expensive any more or less than zapping the predictor... You only build the code running in the privileged domain with retpoline, not all of the code, and they both accomplish very similar things.

I do understand that it applies only to privileged code.

> The performance difference we see between something like retpoline and disabling the predictor on context switches is very significant (retpoline is much, much cheaper).

I expect you are measuring this in a normal timesharing environment, which is not what we have (more below).

> A good way to think about the cost of these things is this. The cost of retpoline we have observed on the kernel:
> 
> 1. the cost of executing the system call with "broken" indirect branch prediction (IE, reliably mispredicted), plus
> 2. the cost of few extra instructions (very, very few cycles)
> 
>   Both of these are very effectively mitigated by efforts to remove hot indirect branches from the system call code in the kernel. Because of the nature of most kernels, this tends to be pretty easy and desirable for performance anyways.

Right.

> By comparison, the cost of toggling off the predictor is:
> 
> 1. the exact same cost as #1 above, plus
> 2. the cost of toggling the MSR on every context switch
> 
>   This second cost, very notably, cannot be meaningfully mitigated by PGO, or hand-implemented hot-path specializations without an indirect branch. And our measurements on Intel hardware at least show that this cost of toggling is actually the dominant cost by a very large margin.

As I said above, I expect you are measuring this on a normal timesharing system.  A game console is not a normal timesharing environment.  We reserve a core to the system and the game gets the rest of them.  My understanding is that the game is one process and it essentially doesn't get process-context-switched (this understanding could be very flawed) although it probably does get thread-context-switched (which should still be cheap as there is no protection-domain transition involved).  If there basically aren't any process context switches while running a game (which is when performance matters most) then moving even a small in-process execution cost to a larger context-switch cost is a good thing, not a bad thing.  We have hard-real-time constraints to worry about.

> So, you should absolutely measure the impact of the AMD solutions you have on your AMD hardware as it may be very significantly different. But I wanted to set the expectation correctly based on what limited experience we have so far (sadly only on Intel hardware).

I appreciate all the work you've done on this, and your sharing of your findings.  Unfortunately on our side we are playing catch-up as we were unaware of the problem before the public announcements and the publication of this patch.  We're definitely doing research and our own measurements to understand what is our best way forward.

(If it wasn't obvious, I'm not standing in the way of the patch, just noting the AMD thing which clearly can be a follow-up if it seems to be a better choice there.)

https://reviews.llvm.org/D41723