[PATCH] D41723: Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre..

Thu Jan 4 10:21:58 PST 2018

rnk added inline comments.

================
Comment at: lld/ELF/Arch/X86.cpp:491
+      0x83, 0xc4, 0x04,             // next: add $4, %esp
+      0x87, 0x04, 0x24,             //   xchg %eax, (%esp)
+      0xc3,                         //   ret
----------------
chandlerc wrote:
> Does it make sense to use something like the `pushl` sequence Reid came up with here? In the non-PLT case it looks like:
> 
> ```
>   addl $4, %esp
>   pushl 4(%esp)
>   pushl 4(%esp)
>   popl 8(%esp)
>   popl (%esp)
> ```
> 
> So it would potentially need to be done a touch differently to work here, but maybe something like that rather than `xchg`?
> 
> Even if the alternative is a lot more instructions, the `xchg` instruction is a locked instruction on x86 and so this will actually create synchronization overhead on the cache line of the top of the stack.
This is a real concern, I checked the Intel manual, and it says:

> Exchanges the contents of the destination (first) and source (second) operands. The operands can be two general purpose
> registers or a register and a memory location. If a memory operand is referenced, the processor’s locking
> protocol is automatically implemented for the duration of the exchange operation, regardless of the presence or
> absence of the LOCK prefix or of the value of the IOPL.

One question is, do we want to try to avoid `PUSH/POP MEM` instructions? LLVM has x86 subtarget features that suggest that these instructions are slow on some CPU models.

To avoid them completely, we could use this code sequence:
```
movl %ecx, (%esp) # save ECX over useless RA
movl 4(%esp), %ecx # load original EAX to ECX
movl %eax, 4(%esp) # store callee over saved EAX
movl %ecx, %eax # restore EAX
popl %ecx # restore ECX
retl # return to callee
```

When it comes down to it, this just implements `xchg` with a scratch register.

On reflection, I think the code sequence above is the best we can do. The PUSH sequence you describe is 8 memory operations vs 4 if we use ECX as scratch.

https://reviews.llvm.org/D41723