[PATCH] D41723: Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre..
Rui Ueyama via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Jan 4 13:02:10 PST 2018
ruiu added inline comments.
================
Comment at: lld/ELF/Arch/X86.cpp:491
+ 0x83, 0xc4, 0x04, // next: add $4, %esp
+ 0x87, 0x04, 0x24, // xchg %eax, (%esp)
+ 0xc3, // ret
----------------
rnk wrote:
> chandlerc wrote:
> > Does it make sense to use something like the `pushl` sequence Reid came up with here? In the non-PLT case it looks like:
> >
> > ```
> > addl $4, %esp
> > pushl 4(%esp)
> > pushl 4(%esp)
> > popl 8(%esp)
> > popl (%esp)
> > ```
> >
> > So it would potentially need to be done a touch differently to work here, but maybe something like that rather than `xchg`?
> >
> > Even if the alternative is a lot more instructions, the `xchg` instruction is a locked instruction on x86 and so this will actually create synchronization overhead on the cache line of the top of the stack.
> This is a real concern, I checked the Intel manual, and it says:
>
> > Exchanges the contents of the destination (first) and source (second) operands. The operands can be two general purpose
> > registers or a register and a memory location. If a memory operand is referenced, the processor’s locking
> > protocol is automatically implemented for the duration of the exchange operation, regardless of the presence or
> > absence of the LOCK prefix or of the value of the IOPL.
>
> One question is, do we want to try to avoid `PUSH/POP MEM` instructions? LLVM has x86 subtarget features that suggest that these instructions are slow on some CPU models.
>
> To avoid them completely, we could use this code sequence:
> ```
> movl %ecx, (%esp) # save ECX over useless RA
> movl 4(%esp), %ecx # load original EAX to ECX
> movl %eax, 4(%esp) # store callee over saved EAX
> movl %ecx, %eax # restore EAX
> popl %ecx # restore ECX
> retl # return to callee
> ```
>
> When it comes down to it, this just implements `xchg` with a scratch register.
>
> On reflection, I think the code sequence above is the best we can do. The PUSH sequence you describe is 8 memory operations vs 4 if we use ECX as scratch.
I didn't know that xchg is so slow. There's no reason not to use push/pop instructions to swap a word at the stack top and a register. Since this is a PLT header (and not a PLT entry), the size of the instrcutions doesn't really matter. Do you want me to update my patch?
================
Comment at: lld/ELF/Arch/X86_64.cpp:517-525
+ 0x4c, 0x8b, 0x1d, 0, 0, 0, 0, // mov foo at GOTPLT(%rip), %r11
+ 0xe8, 0x04, 0x00, 0x00, 0x00, // callq next
+ 0xf3, 0x90, // loop: pause
+ 0xeb, 0xfc, // jmp loop; .align 16
+ 0x4c, 0x89, 0x1c, 0x24, // next: mov %r11, (%rsp)
+ 0xc3, // ret
+ 0x68, 0, 0, 0, 0, // pushq <relocation index>
----------------
Chander,
I also noticed we can improve instructions here. We can use the following instructions instead so that the jump target to lazy-resolve PLT is aligned to 16 byte. I can make a change now if you want.
mov foo at GOTPLT(%rip), %r11
callq next
loop: pause
jmp plt+32; .align 16
pushq <relocation index> // lazy-resolve a PLT entry
jmpq plt[0]
https://reviews.llvm.org/D41723
More information about the llvm-commits
mailing list