[PATCH] D41723: Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre..

Thu Jan 4 18:54:22 PST 2018

Chandler Carruth via Phabricator via llvm-commits
<llvm-commits at lists.llvm.org> writes:

> The goal is simple: avoid generating code which contains an indirect
> branch that could have its prediction poisoned by an attacker. In many
> cases, the compiler can simply use directed conditional branches and
> a small search tree. LLVM already has support for lowering switches in
> this way and the first step of this patch is to disable jump-table
> lowering of switches and introduce a pass to rewrite explicit indirectbr
> sequences into a switch over integers.

Amusingly that is what clang would do before r86256.

> There is one other important source of indirect branches in x86 ELF
> binaries: the PLT. These patches also include support for LLD to
> generate PLT entries that perform a retpoline-style indirection.

I see that we now generate similar code for -fno-plt. We should have a
test for that (nonlazybind at the llvm level).

> The only other indirect branches remaining that we are aware of are from
> precompiled runtimes (such as crt0.o and similar). The ones we have
> found are not really attackable, and so we have not focused on them
> here, but eventually these runtimes should also be replicated for
> retpoline-ed configurations for completeness.
>
> For kernels or other freestanding or fully static executables, the
> compiler switch `-mretpoline` is sufficient to fully mitigate this
> particular attack. For dynamic executables, you must compile *all*
> libraries with `-mretpoline` and additionally link the dynamic
> executable and all shared libraries with LLD and pass `-z retpolineplt`
> (or use similar functionality from some other linker). We strongly
> recommend also using `-z now` as non-lazy binding allows the
> retpoline-mitigated PLT to be substantially smaller.

Since we have to rebuild everything and lazy binding is more expensive,
-fno-plt should be the best mitigation, no?

> +Function *X86RetpolineThunks::createThunkFunction(Module &M, StringRef Name) {
> +  LLVMContext &Ctx = M.getContext();
> +  auto Type = FunctionType::get(Type::getVoidTy(Ctx), false);
> +  Function *F =
> +      Function::Create(Type, GlobalValue::LinkOnceODRLinkage, Name, &M);
> +  F->setVisibility(GlobalValue::HiddenVisibility);

We should probably put this in a comdat so the linker keeps only one.

> Index: llvm/lib/CodeGen/IndirectBrExpandPass.cpp
> ===================================================================
> --- /dev/null
> +++ llvm/lib/CodeGen/IndirectBrExpandPass.cpp
> @@ -0,0 +1,187 @@
> +//===- IndirectBrExpandPass.cpp - Expand indirectbr to switch -------------===//
> +//
> +//                     The LLVM Compiler Infrastructure
> +//
> +// This file is distributed under the University of Illinois Open Source
> +// License. See LICENSE.TXT for details.
> +//
> +//===----------------------------------------------------------------------===//
> +/// \file
> +///
> +/// Implements an expansion pass to turn `indirectbr` instructions in the IR
> +/// into `switch` instructions. This works by enumerating the basic blocks in
> +/// a dense range of integers, replacing each `blockaddr` constant with the
> +/// corresponding integer constant, and then building a switch that maps from
> +/// the integers to the actual blocks.
> +///
> +/// While this is generically useful if a target is unable to codegen
> +/// `indirectbr` natively, it is primarily useful when there is some desire to
> +/// get the builtin non-jump-table lowering of a switch even when the input
> +/// source contained an explicit indirect branch construct.

This pass converts each indirectbr to a switch. One feature we had in
the old clang expansion of computed gotos is that each function would
have a single switch. This would avoid code explosion in direct threaded
interpreters where after each opcode there is a jump that can go any
other opcode.

> Index: lld/test/ELF/x86-64-retpoline.s
> ===================================================================
> --- /dev/null
> +++ lld/test/ELF/x86-64-retpoline.s
> @@ -0,0 +1,51 @@
> +// REQUIRES: x86
> +// RUN: llvm-mc -filetype=obj -triple=x86_64-unknown-linux %s -o %t1.o
> +// RUN: llvm-mc -filetype=obj -triple=x86_64-unknown-linux %p/Inputs/shared.s -o %t2.o
> +// RUN: ld.lld -shared %t2.o -o %t2.so
> +
> +// RUN: ld.lld -shared %t1.o %t2.so -o %t.exe -z retpolineplt
> +// RUN: llvm-objdump -d %t.exe | FileCheck %s

Please add a CHECK line before the following ones to show what section
is being disassembled.

You also want to check the address of the .got.plt to show what some of
the magic numbers in the CHECK are.

The same issue is present in other tests.

> +template <class ELFT>
> +void Retpoline<ELFT>::writeGotPlt(uint8_t *Buf, const Symbol &S) const {
> +  write32le(Buf, S.getPltVA() + 21);

Add a comment about where the 21 is from.

> +template <class ELFT> void Retpoline<ELFT>::writePltHeader(uint8_t *Buf) const {
> +  const uint8_t Insn[] = {
> +      0xff, 0x35, 0, 0, 0, 0,       //   pushq GOTPLT+8(%rip)
> +      0x4c, 0x8b, 0x1d, 0, 0, 0, 0, //   mov GOTPLT+16(%rip), %r11
> +      0xe8, 0x0e, 0x00, 0x00, 0x00, //   callq next
> +      0xf3, 0x90,                   // loop: pause
> +      0xeb, 0xfc,                   //   jmp loop
> +      0x0f, 0x1f, 0x44, 0x00, 0x00, //   nop
> +      0x0f, 0x1f, 0x44, 0x00, 0x00, //   nop; .align 16

The nops are never executed, right? Could we use traps?

> +template <class ELFT>
> +void Retpoline<ELFT>::writePlt(uint8_t *Buf, uint64_t GotPltEntryAddr,
> +                               uint64_t PltEntryAddr, int32_t Index,
> +                               unsigned RelOff) const {
> +  const uint8_t Insn[] = {
> +      0x4c, 0x8b, 0x1d, 0, 0, 0, 0, //   mov foo at GOTPLT(%rip), %r11
> +      0xe8, 0x04, 0x00, 0x00, 0x00, //   callq next
> +      0xf3, 0x90,                   // loop: pause
> +      0xeb, 0xfc,                   //   jmp loop; .align 16
> +      0x4c, 0x89, 0x1c, 0x24,       // next: mov %r11, (%rsp)
> +      0xc3,                         //   ret

This code sequence (mov + ret) exists in the header. Can't we jump there
and reduce the size a bit?

> +void RetpolinePic::writeGotPlt(uint8_t *Buf, const Symbol &S) const {
> +  write32le(Buf, S.getPltVA() + 16);

Add a comment saying why 16.

Cheers,
Rafael