[PATCH] Add Forward-Edge Control-Flow Integrity support

Tue Jul 29 10:28:40 PDT 2014

================
Comment at: include/llvm/Target/TargetInstrInfo.h:342
@@ +341,3 @@
+  /// either the instruction returned by getUnconditionalBranch or the
+  /// instruction returned by getTrap. This only makes sense because
+  /// getUnconditionalBranch returns a single, specific instruction. This
----------------
Tom Roeder wrote:
> JF Bastien wrote:
> > The maximum of uncond branch and trap?
> Right.
Could you clarify the comment for this?

================
Comment at: lib/Target/ARM/ARMBaseInstrInfo.cpp:4364
@@ -4363,2 +4363,3 @@
 
+// This must be kept in sync with getJumpInstrTableEntryBound.
 void ARMBaseInstrInfo::getUnconditionalBranch(
----------------
Tom Roeder wrote:
> JF Bastien wrote:
> > This comment (and the others below and in other files) isn't clear: what has to be kept in sync exactly?
> See my comment after getJumpInstrTableEntryBound: the bound in that function must be a bound on the possible instruction lengths returned by these two functions.
What I meant is that simply reading this comment doesn't provide the relevant information to understand it. It makes sense in this patch, but once checked in I'd have a WTF moment reading just this one comment on its own. You should probably refer the reader to the base class' getJumpInstrTableEntryBound.

================
Comment at: lib/Target/ARM/ARMBaseInstrInfo.cpp:4391
@@ +4390,3 @@
+  // At worst, this will be a branch to a 32-bit value. So, that's one byte for
+  // the branch instruction, and 4 bytes for the value. The trap instruction
+  // fits in 4 bytes, so this suffices as a bound.
----------------
Tom Roeder wrote:
> JF Bastien wrote:
> > Thumb instructions are 2 or 4 bytes, and ARM instructions are always 4. The largest immediate on a direct branch has 26 bits, so that's insufficient for a lot of cases, you'll therefore need a PC-relative load followed by an indirect branch, and a location to store the 32-bit constant pool entry for the address. Assuming you do:
> >   ldr rX, [PC, +#8]
> >   bx rX
> >   0xdeadbeef ; The address
> > Then you need 12 bytes, which I think is the worst case. Note that the constant pool entry could be elsewhere, preferably *not* in executable memory, but that would require another register and more infrastructure.
> > 
> > Note that you may want to use blx instead, if you intend on balancing the CPU's call/return stack: blx would be for calls, see the following page for returns
> >   http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0438i/BABGEAEF.html
> I don't understand, but that's probably because I don't know ARM well. The goal here is to do an direct, unconditional branch using the code that's already checked in in the function getUnconditionalBranch. Are you saying that there are circumstances in which that code doesn't work?
> 
> If not, then are you saying that there are circumstances in which that code will generate this 12 byte sequence? The bound code needs to provide an upper bound for the sequences produced by getUnconditionalBranch and getTrap.
> 
> With respect to b vs blx, the goal is to perform an unconditional branch without touching any stacks at all; does b fail this condition? I'm trying to do the equivalent of the X86 jmp.
ARM has a fixed-width instruction encoding, and immediates are either encoded in the instruction itself or they have to be loaded if they overflow the storage available inside a fixed-width instruction. ARM doesn't have variable-bit-width instructions with very big immediates like x86 does.

26 bits is the biggest PC-relative immediate you can encode in a direct branch (for ARM, it's smaller for Thumb and Thumb2), so branches that go out of bounds from this must be indirect branches (or you can sprinkle the code with trampolines, ew).

So yes, for ARM the upper bounds would be 12 bytes if the sequence I suggest is used. For Thumb1 and Thumb2 it would be 8 (since LDR and BX can fit in 2 bytes each, and the immediate in 4), though for Thumb2 that'll depend on the register allocator using one of the first 8 registers (out of 16) for the LDR instruction and I'm not sure if LLVM will guarantee this.


On ARM branches:
  * B is a direct branch, with a limited PC-relative immediate.
  * BL and BLX are the same direct branches, but they change the link register (aka LR aka r14) meaning that the branch is a call (and LR is set to the address that needs to be returned to, which is the one right after the BL[X] instruction). The "X" means that the instruction set should be eXchanged (if it's currently ARM then go to Thumb, if it's Thumb then go to ARM) it's generally not something you'd want to do!
  * BX is an indirect branch, and so is BLX (same name as the above BLX, but with a register operand).
All of these can be conditionalized, either straight in the instruction in ARM mode or with an IT block in Thumb mode. They can also be unconditional.

I recommend looking at the "ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition". It's available from ARM's site behind a registration (or you can get it from work at arm-eng). See section A8.8.18 and later.


The point I'm trying to make about call/return stack is unrelated to the stack where values are spilled: it's a non-architectural stack the CPU keeps internally of all previous calls, and it uses it to predict return locations for indirect branches to speculatively start executing instructions before knowing where it'll actually jump. If you imbalance that stack then you'll shed performance, sometimes 5%-15%, so keeping it balanced matters!


I'm not sure if that helps?

http://reviews.llvm.org/D4167