[PATCH] D41723: Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre..

Fri Jan 5 21:06:22 PST 2018

chandlerc added a comment.

In https://reviews.llvm.org/D41723#969143, @probinson wrote:

> In https://reviews.llvm.org/D41723#968977, @chandlerc wrote:
>
> > In https://reviews.llvm.org/D41723#968248, @ddibyend wrote:
> >
> > > For AMD processors we may be able to handle indirect jumps via a simpler lfence mechanism. Indirect calls may still require retpoline. If this turns out to be the right solution for AMD processors we may need to put some code in to support this.
> >
> >
> > Yeah, if it ends up that we want non-retpoline mitigations for AMD we can and should add them. One hope I have is that this patch is at least generically *sufficient* (when paired with correct RSB filling) even if it suboptimal in some cases and we end up adding more precise tools later.
>
>
> Just to say that at Sony we're still doing our investigation and might be interested in lfence.  But who knows, we might just zap the predictor on syscalls and context switches; for environments that have mostly a few long-running processes with comparatively few syscalls it might be net cheaper than making every indirection more expensive.

But retpoline doesn't make every indirection more expensive any more or less than zapping the predictor... You only build the code running in the privileged domain with retpoline, not all of the code, and they both accomplish very similar things.

The performance difference we see between something like retpoline and disabling the predictor on context switches is very significant (retpoline is much, much cheaper).

A good way to think about the cost of these things is this. The cost of retpoline we have observed on the kernel:

1. the cost of executing the system call with "broken" indirect branch prediction (IE, reliably mispredicted), plus
2. the cost of few extra instructions (very, very few cycles)

Both of these are very effectively mitigated by efforts to remove hot indirect branches from the system call code in the kernel. Because of the nature of most kernels, this tends to be pretty easy and desirable for performance anyways.

By comparison, the cost of toggling off the predictor is:

1. the exact same cost as #1 above, plus
2. the cost of toggling the MSR on every context switch

This second cost, very notably, cannot be meaningfully mitigated by PGO, or hand-implemented hot-path specializations without an indirect branch. And our measurements on Intel hardware at least show that this cost of toggling is actually the dominant cost by a very large margin.

So, you should absolutely measure the impact of the AMD solutions you have on your AMD hardware as it may be very significantly different. But I wanted to set the expectation correctly based on what limited experience we have so far (sadly only on Intel hardware).

https://reviews.llvm.org/D41723