[PATCH] D41723: Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre..

Thu Jan 4 13:02:10 PST 2018

ruiu added inline comments.

================
Comment at: lld/ELF/Arch/X86.cpp:491
+      0x83, 0xc4, 0x04,             // next: add $4, %esp
+      0x87, 0x04, 0x24,             //   xchg %eax, (%esp)
+      0xc3,                         //   ret
----------------
rnk wrote:
> chandlerc wrote:
> > Does it make sense to use something like the `pushl` sequence Reid came up with here? In the non-PLT case it looks like:
> > 
> > ```
> >   addl $4, %esp
> >   pushl 4(%esp)
> >   pushl 4(%esp)
> >   popl 8(%esp)
> >   popl (%esp)
> > ```
> > 
> > So it would potentially need to be done a touch differently to work here, but maybe something like that rather than `xchg`?
> > 
> > Even if the alternative is a lot more instructions, the `xchg` instruction is a locked instruction on x86 and so this will actually create synchronization overhead on the cache line of the top of the stack.
> This is a real concern, I checked the Intel manual, and it says:
> 
> > Exchanges the contents of the destination (first) and source (second) operands. The operands can be two general purpose
> > registers or a register and a memory location. If a memory operand is referenced, the processor’s locking
> > protocol is automatically implemented for the duration of the exchange operation, regardless of the presence or
> > absence of the LOCK prefix or of the value of the IOPL.
> 
> One question is, do we want to try to avoid `PUSH/POP MEM` instructions? LLVM has x86 subtarget features that suggest that these instructions are slow on some CPU models.
> 
> To avoid them completely, we could use this code sequence:
> ```
> movl %ecx, (%esp) # save ECX over useless RA
> movl 4(%esp), %ecx # load original EAX to ECX
> movl %eax, 4(%esp) # store callee over saved EAX
> movl %ecx, %eax # restore EAX
> popl %ecx # restore ECX
> retl # return to callee
> ```
> 
> When it comes down to it, this just implements `xchg` with a scratch register.
> 
> On reflection, I think the code sequence above is the best we can do. The PUSH sequence you describe is 8 memory operations vs 4 if we use ECX as scratch.
I didn't know that xchg is so slow. There's no reason not to use push/pop instructions to swap a word at the stack top and a register. Since this is a PLT header (and not a PLT entry), the size of the instrcutions doesn't really matter. Do you want me to update my patch?

================
Comment at: lld/ELF/Arch/X86_64.cpp:517-525
+      0x4c, 0x8b, 0x1d, 0, 0, 0, 0, //   mov foo at GOTPLT(%rip), %r11
+      0xe8, 0x04, 0x00, 0x00, 0x00, //   callq next
+      0xf3, 0x90,                   // loop: pause
+      0xeb, 0xfc,                   //   jmp loop; .align 16
+      0x4c, 0x89, 0x1c, 0x24,       // next: mov %r11, (%rsp)
+      0xc3,                         //   ret
+      0x68, 0, 0, 0, 0,             //   pushq <relocation index>
----------------
Chander, 

I also noticed we can improve instructions here. We can use the following instructions instead so that the jump target to lazy-resolve PLT is aligned to 16 byte. I can make a change now if you want.

    mov foo at GOTPLT(%rip), %r11
    callq next
  loop: pause
    jmp plt+32; .align 16
    pushq <relocation index> // lazy-resolve a PLT entry
    jmpq plt[0]

https://reviews.llvm.org/D41723