[PATCH] D41723: Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre..
Reid Kleckner via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Jan 4 10:21:58 PST 2018
rnk added inline comments.
Comment at: lld/ELF/Arch/X86.cpp:491
+ 0x83, 0xc4, 0x04, // next: add $4, %esp
+ 0x87, 0x04, 0x24, // xchg %eax, (%esp)
+ 0xc3, // ret
> Does it make sense to use something like the `pushl` sequence Reid came up with here? In the non-PLT case it looks like:
> addl $4, %esp
> pushl 4(%esp)
> pushl 4(%esp)
> popl 8(%esp)
> popl (%esp)
> So it would potentially need to be done a touch differently to work here, but maybe something like that rather than `xchg`?
> Even if the alternative is a lot more instructions, the `xchg` instruction is a locked instruction on x86 and so this will actually create synchronization overhead on the cache line of the top of the stack.
This is a real concern, I checked the Intel manual, and it says:
> Exchanges the contents of the destination (first) and source (second) operands. The operands can be two general purpose
> registers or a register and a memory location. If a memory operand is referenced, the processor’s locking
> protocol is automatically implemented for the duration of the exchange operation, regardless of the presence or
> absence of the LOCK prefix or of the value of the IOPL.
One question is, do we want to try to avoid `PUSH/POP MEM` instructions? LLVM has x86 subtarget features that suggest that these instructions are slow on some CPU models.
To avoid them completely, we could use this code sequence:
movl %ecx, (%esp) # save ECX over useless RA
movl 4(%esp), %ecx # load original EAX to ECX
movl %eax, 4(%esp) # store callee over saved EAX
movl %ecx, %eax # restore EAX
popl %ecx # restore ECX
retl # return to callee
When it comes down to it, this just implements `xchg` with a scratch register.
On reflection, I think the code sequence above is the best we can do. The PUSH sequence you describe is 8 memory operations vs 4 if we use ECX as scratch.
More information about the llvm-commits