[cfe-dev] [llvm-dev] the as-if rule / perf vs. security

Sanjay Patel via cfe-dev cfe-dev at lists.llvm.org
Wed Mar 16 11:28:38 PDT 2016


We are careful not to try this optimization where it would extend the range
of loaded memory; this is purely for what I call a "load doughnut". :)
Reading past either specified edge would be very bad because it could cause
a memory fault / exception where there was none in the original program.
That's definitely not legal.

On Wed, Mar 16, 2016 at 12:20 PM, Craig, Ben <ben.craig at codeaurora.org>
wrote:

> I'm having a hard time finding any problems here, at least as long as the
> value is in the middle.  I wouldn't expect the contents of x[2] to affect
> the timing or power usage of anything.  I guess there would be  a minor
> "bad" side effect in that a memory read watchpoint would trigger with the
> 128 bit load that wouldn't be there with the 32-bit loads.  I think it is
> semantically very similar to this situation as well...
>
> v4i32 first_call(int *x) { //use all of the array
>    int f0 = x[0];
>    int f1 = x[1];
>    int f2 = x[2];
>    int f3 = x[3];
>    return (v4i32) { f0, f1, f2, f3 };
> }
> v4i32 second_call(int *x) { //use some of the array
>    int s0 = x[0];
>    int s1 = x[1];
>    int s2 = 0;
>    int s3 = x[3];
>    return (v4i32) { s0, s1, s2, s3 };
> }
> first_call(x);
> second_call(x);
>
> The implementation isn't going to zero out the stack in between those
> calls, so for a short period of time, the memory location of s2 will
> contain x[2].
>
> I'm less sure if the gaps are on the edges.  I'm worried that you might
> ending up crossing some important address boundary if you look at something
> earlier or later than what the user requested.
>
>
> On 3/16/2016 11:38 AM, Sanjay Patel wrote:
>
> Hi Ben -
>
> Thanks for your response. For the sake of argument, let's narrow the scope
> of the problem to eliminate some of the variables you have rightfully
> cited.
>
> Let's assume we're not dealing with volatiles, atomics, or FP operands.
> We'll even guarantee that the extra loaded value is never used. This is, in
> fact, the scenario that <http://reviews.llvm.org/rL263446>
> http://reviews.llvm.org/rL263446 is concerned with.
>
> Related C example:
>
> typedef int v4i32 __attribute__((__vector_size__(16)));
>
> // Load some almost-consecutive ints as a vector.
> v4i32 foo(int *x) {
>    int x0 = x[0];
>    int x1 = x[1];
> // int x2 = x[2];   // U can't touch this?
>    int x3 = x[3];
>    return (v4i32) { x0, x1, 0, x3 };
> }
>
> For x86, we notice that we have nearly a v4i32 vector's worth of loads, so
> we just turn that into a vector load and mask out the element that's
> getting set to zero:
>     movups    (%rdi), %xmm0            ; load 128-bits instead of three
> 32-bit elements
>     andps    LCPI0_0(%rip), %xmm0 ; put zero bits into the 3rd element of
> the vector
>
> Should that optimization be disabled by a hypothetical -fextra-secure flag?
>
>
>
> On Wed, Mar 16, 2016 at 7:59 AM, Craig, Ben <ben.craig at codeaurora.org>
> wrote:
>
>> Regarding accessing extra data, there are at least some limits as to what
>> can be accessed.  You can't generate extra loads or stores to volatiles.
>> You can't generate extra stores to atomics, even if the extra stores appear
>> to be the same value as the old value.
>>
>> As for determining where the perf vs. security line should be drawn, I
>> would argue that most compilers have gone too far on the perf side while
>> optimizing undefined behavior.  Dead store elimination leaving passwords in
>> memory, integer overflow checks getting optimized out, and NULL checks
>> optimized away.  Linus Torvalds was complaining about those just recently
>> on this list, and while I don't share his tone, I agree with him regarding
>> the harm these optimizations can cause.
>>
>> If I'm understanding correctly, for your specific cases, you are
>> wondering if it is fine to load and operate on a floating point value that
>> the user did not specifically request you to operate on.  This could cause
>> (at least) two different problems.  First, it could cause a floating point
>> exception.  I think the danger of the floating point exception should rule
>> out loading values the user didn't request.  Second, loading values the
>> user didn't specify could enable a timing attack.  The timing attack is
>> scary, but I don't think it is something we can really fix in the general
>> case.  As long as individual assembly instructions have
>> impractical-to-predict execution times, we will be at the mercy of the
>> current hardware state.  There are timing attacks that can determine TLS
>> keys in a different VM instance based off of how quickly loads in the
>> current process execute.  If our worst timing attack problems are floating
>> point denormalization issues, then I think we are in a pretty good state.
>>
>>
>> On 3/15/2016 10:46 AM, Sanjay Patel via llvm-dev wrote:
>>
>> [cc'ing cfe-dev because this may require some interpretation of language
>> law]
>>
>> My understanding is that the compiler has the freedom to access extra
>> data in C/C++ (not sure about other languages); AFAIK, the LLVM LangRef is
>> silent about this. In C/C++, this is based on the "as-if rule":
>> http://en.cppreference.com/w/cpp/language/as_if
>>
>> So the question is: where should the optimizer draw the line with respect
>> to perf vs. security if it involves operating on unknown data? Are there
>> guidelines that we can use to decide this?
>>
>> The masked load transform referenced below is not unique in accessing /
>> operating on unknown data. In addition to the related scalar loads ->
>> vector load transform that I've mentioned earlier in this thread, see for
>> example:
>> https://llvm.org/bugs/show_bug.cgi?id=20358
>> (and the security paper and patch review linked there)
>>
>>
>> On Mon, Mar 14, 2016 at 10:26 PM, Shahid, Asghar-ahmad <
>> Asghar-ahmad.Shahid at amd.com> wrote:
>>
>>> Hi Sanjay,
>>>
>>>
>>>
>>> >The real question I have is whether it is legal to read the extra
>>> memory, regardless of whether this is a masked load or
>>>
>>> >something else.
>>>
>>> No, It is not legal AFAIK because by doing that we are exposing the
>>> content of the memory which programmer
>>>
>>> does not intend to. This may be vulnerable for exploitation.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Shahid
>>>
>>>
>>>
>>>
>>>
>>> *From:* llvm-dev [mailto: <llvm-dev-bounces at lists.llvm.org>
>>> llvm-dev-bounces at lists.llvm.org] *On Behalf Of *Sanjay Patel via
>>> llvm-dev
>>> *Sent:* Monday, March 14, 2016 10:37 PM
>>> *To:* Nema, Ashutosh
>>> *Cc:* llvm-dev
>>> *Subject:* Re: [llvm-dev] masked-load endpoints optimization
>>>
>>>
>>>
>>> I checked in a patch to do this transform for x86-only for now:
>>> http://reviews.llvm.org/D18094 / http://reviews.llvm.org/rL263446
>>>
>>>
>>>
>>> On Fri, Mar 11, 2016 at 9:57 AM, Sanjay Patel < <spatel at rotateright.com>
>>> spatel at rotateright.com> wrote:
>>>
>>> Thanks, Ashutosh.
>>>
>>> Yes, either TTI or TLI could be used to limit the transform if we do it
>>> in CGP rather than the DAG.
>>>
>>> The real question I have is whether it is legal to read the extra
>>> memory, regardless of whether this is a masked load or something else.
>>>
>>> Note that the x86 backend already does this, so either my proposal is ok
>>> for x86, or we're already doing an illegal optimization:
>>>
>>>
>>> define <4 x i32> @load_bonus_bytes(i32* %addr1, <4 x i32> %v) {
>>>   %ld1 = load i32, i32* %addr1
>>>   %addr2 = getelementptr i32, i32* %addr1, i64 3
>>>   %ld2 = load i32, i32* %addr2
>>>   %vec1 = insertelement <4 x i32> undef, i32 %ld1, i32 0
>>>   %vec2 = insertelement <4 x i32> %vec1, i32 %ld2, i32 3
>>>   ret <4 x i32> %vec2
>>> }
>>>
>>> $ ./llc -o - loadcombine.ll
>>> ...
>>>     movups    (%rdi), %xmm0
>>>     retq
>>>
>>>
>>>
>>>
>>> On Thu, Mar 10, 2016 at 10:22 PM, Nema, Ashutosh <
>>> <Ashutosh.Nema at amd.com>Ashutosh.Nema at amd.com> wrote:
>>>
>>> This looks interesting, the main motivation appears to be replacing
>>> masked vector load with a general vector load followed by a select.
>>>
>>>
>>>
>>> Observed masked vector loads are in general expensive in comparison with
>>> a vector load.
>>>
>>>
>>>
>>> But if first & last element of a masked vector load are guaranteed to be
>>> accessed then it can be transformed to a vector load.
>>>
>>>
>>>
>>> In opt this can be driven by TTI, where the benefit of this
>>> transformation should be checked.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Ashutosh
>>>
>>>
>>>
>>> *From:* llvm-dev [mailto: <llvm-dev-bounces at lists.llvm.org>
>>> llvm-dev-bounces at lists.llvm.org] *On Behalf Of *Sanjay Patel via
>>> llvm-dev
>>> *Sent:* Friday, March 11, 2016 3:37 AM
>>> *To:* llvm-dev
>>> *Subject:* [llvm-dev] masked-load endpoints optimization
>>>
>>>
>>>
>>> If we're loading the first and last elements of a vector using a masked
>>> load [1], can we replace the masked load with a full vector load?
>>>
>>> "The result of this operation is equivalent to a regular vector load
>>> instruction followed by a ‘select’ between the loaded and the passthru
>>> values, predicated on the same mask. However, using this intrinsic prevents
>>> exceptions on memory access to masked-off lanes."
>>>
>>> I think the fact that we're loading the endpoints of the vector
>>> guarantees that a full vector load can't have any different
>>> faulting/exception behavior on x86 and most (?) other targets. We would,
>>> however, be reading memory that the program has not explicitly requested.
>>>
>>> IR example:
>>>
>>> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) {
>>>
>>>   ; load the first and last elements pointed to by %addr and shuffle
>>> those into %v
>>>
>>>   %res = call <4 x i32> @llvm.masked.load.v4i32(<4 x i32>* %addr, i32 4,
>>> <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %v)
>>>   ret <4 x i32> %res
>>> }
>>>
>>> would become something like:
>>>
>>>
>>> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) {
>>>
>>>   %vecload = load <4 x i32>, <4 x i32>* %addr, align 4
>>>
>>>   %sel = select <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %vecload,
>>> <4 x i32> %v
>>>
>>>   ret <4 x i32> %sel
>>> }
>>>
>>> If this isn't valid as an IR optimization, would it be acceptable as a
>>> DAG combine with target hook to opt in?
>>>
>>>
>>> [1] <http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics>
>>> <http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics>
>>> http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing listllvm-dev at lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>> --
>> Employee of Qualcomm Innovation Center, Inc.
>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
>>
>>
>
> --
> Employee of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160316/12ffe32b/attachment.html>


More information about the cfe-dev mailing list