<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
On 2/7/2017 20:02, Kostya Serebryany wrote:<br>
<blockquote
cite="mid:CAN=P9pjH4A0GdAT_7pd8YjUV+9T+XZMyEbxjQkQ3GW1qebwAFw@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">On Tue, Feb 7, 2017 at 4:05 PM,
LeMay, Michael via llvm-dev
<span dir="ltr"><<a moz-do-not-send="true"
href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"><br>
</blockquote>
</div>
</div>
</div>
</blockquote>
...<br>
<blockquote
cite="mid:CAN=P9pjH4A0GdAT_7pd8YjUV+9T+XZMyEbxjQkQ3GW1qebwAFw@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"> <br>
</blockquote>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
The runtime library [1] simply initializes one bounds
register, BND0, to have an upper bound that is set below
all safe stacks and above all ordinary data.
</blockquote>
<div><br>
</div>
<div>So you enforce that safe stacks and other data are not
intermixed, as you explain below. </div>
<div>What are the downsides? Performance? Compatibility? <br>
</div>
</div>
</div>
</div>
</blockquote>
<br>
I think the main downside is that only a limited number of threads
can be created before the safe stacks would protrude below the
bound. Extending the proposed runtime library to deallocate safe
stacks when they are no longer needed may help with this. The safe
stacks are also prevented from expanding, since they are allocated
contiguously at high addresses.<br>
<br>
<blockquote
cite="mid:CAN=P9pjH4A0GdAT_7pd8YjUV+9T+XZMyEbxjQkQ3GW1qebwAFw@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
A pre-isel patch instruments stores that are not
authorized to access the safe stack by preceding each such
instruction with a BNDCU instruction.
</blockquote>
<div><br>
</div>
<div>My understanding is that BNDCU is the cheapest possible
instruction, just like XOR or ADD, </div>
<div>so the overhead should be relatively small. </div>
<div>Still my guesstimate would be >= 5% since stores are
very numerous. </div>
<div>And such overhead will be on top of whatever overhead
SafeStack has. </div>
<div>Do you have any measurements to share? <br>
</div>
</div>
</div>
</div>
</blockquote>
<br>
I'm working on getting approval to release some benchmark results.<br>
<br>
<blockquote
cite="mid:CAN=P9pjH4A0GdAT_7pd8YjUV+9T+XZMyEbxjQkQ3GW1qebwAFw@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
That checks whether the following store accesses memory
that is entirely below the upper bound in BND0 [2]. Loads
are not instrumented, since the purpose of the checks is
only to help prevent corruption of the safe stacks.
Authorized safe stack accesses are not instrumented, since
the SafeStack pass is responsible for verifying that such
accesses do not corrupt the safe stack. The default
handler is used when a bound check fails, which results in
the program being terminated on the systems where I have
performed tests.<br>
<br>
To reduce the performance and size overhead from
instrumenting the code, both the pre-isel patch and a
pre-emit patch elide various checks [2, 3]. The pre-isel
patch uses techniques derived from the BoundsChecking pass
to statically verify that some stores are safe so that the
checks for those stores can be elided. The pre-emit patch
compares the bound checks in each basic block and combines
those that are redundant. The contents of BND0 are
static, so a successful check of a higher address implies
that any check of a lower address will also succeed.
Thus, if a check of a higher address precedes a check of a
lower address in a basic block, the latter check can be
erased. On the other hand, if a check of a lower address
precedes a check of a higher address in a basic block,
then the latter check can still be erased, but it is also
necessary to use the higher address in the remaining
check. However, my pass is only able to statically
compare certain addresses, which limits the checks that
can be combined. For example, if two addresses use the
same base and index registers and scale along with a
simple displacement, then my pass may be able to compare
them. However, if either the base or the index register
is redefined by an instruction between the two checks,
then my pass is currently unable to compare the two
addresses. </blockquote>
<div><br>
</div>
<div>The usual question in such situation: how do we verify
that the optimizations are not too optimistic? </div>
<div>If we remove a check that is not in fact redundant, we
will never know, until clever folks use it for an exploit
(and maybe not even then). <br>
</div>
</div>
</div>
</div>
</blockquote>
<br>
The pre-emit pass is able to verify that some checks are redundant
by inspecting the operands used to specify an address. For example,
consider the following test for the pre-emit pass:<br>
<br>
0: %rax = MOVSX64rr32 killed %edi<br>
1: INLINEASM $"bndcu $0, %bnd0", 8, 196654, _, 8, %rax, @x + 4,
_<br>
; CHECK: INLINEASM $"bndcu $0, %bnd0", 8, 196654, _, 8, %rax, @x
+ 8, _<br>
2: MOV32mi _, 8, %rax, @x, _, 0<br>
3: INLINEASM $"bndcu $0, %bnd0", 8, 196654, _, 8, %rax, @x + 8,
_<br>
; CHECK-NOT: INLINEASM $"bndcu $0, %bnd0", 8, 196654, _, 8,
%rax, @x + 8, _<br>
4: MOV32mi _, 8, killed %rax, @x + 4, _, 0<br>
<br>
The pass verifies that the only difference between the memory
operands in instructions 1 and 3 is that they use a different offset
from the global variable, so they can be combined. The pass also
tracks register definitions, so it would know not to combine the
checks in this example if there had been an instruction that
redefined %rax between instructions 1 and 3.<br>
<br>
On the other hand, some of the optimizations described in the next
couple of paragraphs may be optimistic, so I especially welcome
feedback on them:<br>
<br>
...<br>
<br>
<blockquote
cite="mid:CAN=P9pjH4A0GdAT_7pd8YjUV+9T+XZMyEbxjQkQ3GW1qebwAFw@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
The pre-emit pass also erases checks for addresses that do
not specify a base or index register as well as those that
specify a RIP-relative offset with no index register. I
think that the source code would need to be quite
malformed to corrupt safe stacks using such address types.<br>
</blockquote>
</div>
</div>
</div>
</blockquote>
...<br>
<blockquote
cite="mid:CAN=P9pjH4A0GdAT_7pd8YjUV+9T+XZMyEbxjQkQ3GW1qebwAFw@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
The pre-emit pass also erases bound checks for accesses
relative to a non-default segment, such as thread-local
accesses relative to FS. Linear addresses for
thread-local accesses are computed with a non-zero segment
base address, so it would be necessary to check
thread-local effective addresses against a bounds register
with an upper bound that is adjusted down to account for
that rather than the bounds register that is used for
checking other accesses. However, negative offsets are
sometimes used for thread-local accesses, which are
treated as very large unsigned effective addresses.
Checking them would require them to first be added to the
base of the thread-local storage segment.<br>
</blockquote>
</div>
</div>
</div>
</blockquote>
...<br>
<br>
Thanks,<br>
Michael<br>
<br>
</body>
</html>