[llvm-dev] [cfe-dev] the as-if rule / perf vs. security

Thu Mar 17 07:00:03 PDT 2016

On Wed, Mar 16, 2016 at 2:28 PM, Sanjay Patel via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

> We are careful not to try this optimization where it would extend the
> range of loaded memory; this is purely for what I call a "load doughnut". :)
> Reading past either specified edge would be very bad because it could
> cause a memory fault / exception where there was none in the original
> program. That's definitely not legal.
>
Perhaps this PR would interest you:
https://llvm.org/bugs/show_bug.cgi?id=25474.

>
>
> On Wed, Mar 16, 2016 at 12:20 PM, Craig, Ben <ben.craig at codeaurora.org>
> wrote:
>
>> I'm having a hard time finding any problems here, at least as long as the
>> value is in the middle.  I wouldn't expect the contents of x[2] to affect
>> the timing or power usage of anything.  I guess there would be  a minor
>> "bad" side effect in that a memory read watchpoint would trigger with the
>> 128 bit load that wouldn't be there with the 32-bit loads.  I think it is
>> semantically very similar to this situation as well...
>>
>> v4i32 first_call(int *x) { //use all of the array
>>    int f0 = x[0];
>>    int f1 = x[1];
>>    int f2 = x[2];
>>    int f3 = x[3];
>>    return (v4i32) { f0, f1, f2, f3 };
>> }
>> v4i32 second_call(int *x) { //use some of the array
>>    int s0 = x[0];
>>    int s1 = x[1];
>>    int s2 = 0;
>>    int s3 = x[3];
>>    return (v4i32) { s0, s1, s2, s3 };
>> }
>> first_call(x);
>> second_call(x);
>>
>> The implementation isn't going to zero out the stack in between those
>> calls, so for a short period of time, the memory location of s2 will
>> contain x[2].
>>
>> I'm less sure if the gaps are on the edges.  I'm worried that you might
>> ending up crossing some important address boundary if you look at something
>> earlier or later than what the user requested.
>>
>>
>> On 3/16/2016 11:38 AM, Sanjay Patel wrote:
>>
>> Hi Ben -
>>
>> Thanks for your response. For the sake of argument, let's narrow the
>> scope of the problem to eliminate some of the variables you have rightfully
>> cited.
>>
>> Let's assume we're not dealing with volatiles, atomics, or FP operands.
>> We'll even guarantee that the extra loaded value is never used. This is, in
>> fact, the scenario that <http://reviews.llvm.org/rL263446>
>> http://reviews.llvm.org/rL263446 is concerned with.
>>
>> Related C example:
>>
>> typedef int v4i32 __attribute__((__vector_size__(16)));
>>
>> // Load some almost-consecutive ints as a vector.
>> v4i32 foo(int *x) {
>>    int x0 = x[0];
>>    int x1 = x[1];
>> // int x2 = x[2];   // U can't touch this?
>>    int x3 = x[3];
>>    return (v4i32) { x0, x1, 0, x3 };
>> }
>>
>> For x86, we notice that we have nearly a v4i32 vector's worth of loads,
>> so we just turn that into a vector load and mask out the element that's
>> getting set to zero:
>>     movups    (%rdi), %xmm0            ; load 128-bits instead of three
>> 32-bit elements
>>     andps    LCPI0_0(%rip), %xmm0 ; put zero bits into the 3rd element of
>> the vector
>>
>> Should that optimization be disabled by a hypothetical -fextra-secure
>> flag?
>>
>>
>>
>> On Wed, Mar 16, 2016 at 7:59 AM, Craig, Ben <ben.craig at codeaurora.org>
>> wrote:
>>
>>> Regarding accessing extra data, there are at least some limits as to
>>> what can be accessed.  You can't generate extra loads or stores to
>>> volatiles.  You can't generate extra stores to atomics, even if the extra
>>> stores appear to be the same value as the old value.
>>>
>>> As for determining where the perf vs. security line should be drawn, I
>>> would argue that most compilers have gone too far on the perf side while
>>> optimizing undefined behavior.  Dead store elimination leaving passwords in
>>> memory, integer overflow checks getting optimized out, and NULL checks
>>> optimized away.  Linus Torvalds was complaining about those just recently
>>> on this list, and while I don't share his tone, I agree with him regarding
>>> the harm these optimizations can cause.
>>>
>>> If I'm understanding correctly, for your specific cases, you are
>>> wondering if it is fine to load and operate on a floating point value that
>>> the user did not specifically request you to operate on.  This could cause
>>> (at least) two different problems.  First, it could cause a floating point
>>> exception.  I think the danger of the floating point exception should rule
>>> out loading values the user didn't request.  Second, loading values the
>>> user didn't specify could enable a timing attack.  The timing attack is
>>> scary, but I don't think it is something we can really fix in the general
>>> case.  As long as individual assembly instructions have
>>> impractical-to-predict execution times, we will be at the mercy of the
>>> current hardware state.  There are timing attacks that can determine TLS
>>> keys in a different VM instance based off of how quickly loads in the
>>> current process execute.  If our worst timing attack problems are floating
>>> point denormalization issues, then I think we are in a pretty good state.
>>>
>>>
>>> On 3/15/2016 10:46 AM, Sanjay Patel via llvm-dev wrote:
>>>
>>> [cc'ing cfe-dev because this may require some interpretation of language
>>> law]
>>>
>>> My understanding is that the compiler has the freedom to access extra
>>> data in C/C++ (not sure about other languages); AFAIK, the LLVM LangRef is
>>> silent about this. In C/C++, this is based on the "as-if rule":
>>> http://en.cppreference.com/w/cpp/language/as_if
>>>
>>> So the question is: where should the optimizer draw the line with
>>> respect to perf vs. security if it involves operating on unknown data? Are
>>> there guidelines that we can use to decide this?
>>>
>>> The masked load transform referenced below is not unique in accessing /
>>> operating on unknown data. In addition to the related scalar loads ->
>>> vector load transform that I've mentioned earlier in this thread, see for
>>> example:
>>> https://llvm.org/bugs/show_bug.cgi?id=20358
>>> (and the security paper and patch review linked there)
>>>
>>>
>>> On Mon, Mar 14, 2016 at 10:26 PM, Shahid, Asghar-ahmad <
>>> Asghar-ahmad.Shahid at amd.com> wrote:
>>>
>>>> Hi Sanjay,
>>>>
>>>>
>>>>
>>>> >The real question I have is whether it is legal to read the extra
>>>> memory, regardless of whether this is a masked load or
>>>>
>>>> >something else.
>>>>
>>>> No, It is not legal AFAIK because by doing that we are exposing the
>>>> content of the memory which programmer
>>>>
>>>> does not intend to. This may be vulnerable for exploitation.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Shahid
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *From:* llvm-dev [mailto: <llvm-dev-bounces at lists.llvm.org>
>>>> llvm-dev-bounces at lists.llvm.org] *On Behalf Of *Sanjay Patel via
>>>> llvm-dev
>>>> *Sent:* Monday, March 14, 2016 10:37 PM
>>>> *To:* Nema, Ashutosh
>>>> *Cc:* llvm-dev
>>>> *Subject:* Re: [llvm-dev] masked-load endpoints optimization
>>>>
>>>>
>>>>
>>>> I checked in a patch to do this transform for x86-only for now:
>>>> http://reviews.llvm.org/D18094 / http://reviews.llvm.org/rL263446
>>>>
>>>>
>>>>
>>>> On Fri, Mar 11, 2016 at 9:57 AM, Sanjay Patel <
>>>> <spatel at rotateright.com>spatel at rotateright.com> wrote:
>>>>
>>>> Thanks, Ashutosh.
>>>>
>>>> Yes, either TTI or TLI could be used to limit the transform if we do it
>>>> in CGP rather than the DAG.
>>>>
>>>> The real question I have is whether it is legal to read the extra
>>>> memory, regardless of whether this is a masked load or something else.
>>>>
>>>> Note that the x86 backend already does this, so either my proposal is
>>>> ok for x86, or we're already doing an illegal optimization:
>>>>
>>>>
>>>> define <4 x i32> @load_bonus_bytes(i32* %addr1, <4 x i32> %v) {
>>>>   %ld1 = load i32, i32* %addr1
>>>>   %addr2 = getelementptr i32, i32* %addr1, i64 3
>>>>   %ld2 = load i32, i32* %addr2
>>>>   %vec1 = insertelement <4 x i32> undef, i32 %ld1, i32 0
>>>>   %vec2 = insertelement <4 x i32> %vec1, i32 %ld2, i32 3
>>>>   ret <4 x i32> %vec2
>>>> }
>>>>
>>>> $ ./llc -o - loadcombine.ll
>>>> ...
>>>>     movups    (%rdi), %xmm0
>>>>     retq
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Mar 10, 2016 at 10:22 PM, Nema, Ashutosh <
>>>> <Ashutosh.Nema at amd.com>Ashutosh.Nema at amd.com> wrote:
>>>>
>>>> This looks interesting, the main motivation appears to be replacing
>>>> masked vector load with a general vector load followed by a select.
>>>>
>>>>
>>>>
>>>> Observed masked vector loads are in general expensive in comparison
>>>> with a vector load.
>>>>
>>>>
>>>>
>>>> But if first & last element of a masked vector load are guaranteed to
>>>> be accessed then it can be transformed to a vector load.
>>>>
>>>>
>>>>
>>>> In opt this can be driven by TTI, where the benefit of this
>>>> transformation should be checked.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Ashutosh
>>>>
>>>>
>>>>
>>>> *From:* llvm-dev [mailto: <llvm-dev-bounces at lists.llvm.org>
>>>> llvm-dev-bounces at lists.llvm.org] *On Behalf Of *Sanjay Patel via
>>>> llvm-dev
>>>> *Sent:* Friday, March 11, 2016 3:37 AM
>>>> *To:* llvm-dev
>>>> *Subject:* [llvm-dev] masked-load endpoints optimization
>>>>
>>>>
>>>>
>>>> If we're loading the first and last elements of a vector using a masked
>>>> load [1], can we replace the masked load with a full vector load?
>>>>
>>>> "The result of this operation is equivalent to a regular vector load
>>>> instruction followed by a ‘select’ between the loaded and the passthru
>>>> values, predicated on the same mask. However, using this intrinsic prevents
>>>> exceptions on memory access to masked-off lanes."
>>>>
>>>> I think the fact that we're loading the endpoints of the vector
>>>> guarantees that a full vector load can't have any different
>>>> faulting/exception behavior on x86 and most (?) other targets. We would,
>>>> however, be reading memory that the program has not explicitly requested.
>>>>
>>>> IR example:
>>>>
>>>> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) {
>>>>
>>>>   ; load the first and last elements pointed to by %addr and shuffle
>>>> those into %v
>>>>
>>>>   %res = call <4 x i32> @llvm.masked.load.v4i32(<4 x i32>* %addr, i32
>>>> 4, <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %v)
>>>>   ret <4 x i32> %res
>>>> }
>>>>
>>>> would become something like:
>>>>
>>>>
>>>> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) {
>>>>
>>>>   %vecload = load <4 x i32>, <4 x i32>* %addr, align 4
>>>>
>>>>   %sel = select <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %vecload,
>>>> <4 x i32> %v
>>>>
>>>>   ret <4 x i32> %sel
>>>> }
>>>>
>>>> If this isn't valid as an IR optimization, would it be acceptable as a
>>>> DAG combine with target hook to opt in?
>>>>
>>>>
>>>> [1] <http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics>
>>>> <http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics>
>>>> http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing listllvm-dev at lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>> --
>>> Employee of Qualcomm Innovation Center, Inc.
>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
>>>
>>>
>>
>> --
>> Employee of Qualcomm Innovation Center, Inc.
>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
>>
>>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160317/56aa186f/attachment-0001.html>