[cfe-dev] ubsan - active member check for unions

Tue Dec 23 06:51:51 PST 2014

On Fri, Dec 19, 2014 at 2:26 AM, Richard Smith <richard at metafoo.co.uk> wrote:
> On Mon, Dec 15, 2014 at 11:24 AM, Ismail Pazarbasi
> <ismail.pazarbasi at gmail.com> wrote:
>>
>> Hi,
>>
>> I decided to implement a sanitizer that tracks the active member, and
>> reports reads from inactive members.
>>
>> e.g.:
>>   struct S {
>>     union {
>>       long l;
>>       char s[42];
>>       double d;
>>     };
>>   };
>>   void f() {
>>     S s;
>>     s.d = 42.0;
>>     if (s.l > 100) // fire here
>>       ;
>>   }
>>
>> My current implementation is naïve (also possibly wrong), and doesn't
>> care about "special guarantee" that allows inspecting common initial
>> sequence. Briefly;
>> 1. in EmitBinaryOperatorLValue, if we are storing value of a union
>> member, we call a ubsan runtime function (not a handler) that maps
>> pointer value (of the union) to a tuple/struct (of TypeDescr, active
>> field index, expression location that has changed the active member,
>> and array of field names)
>> 2. When a union member is being loaded, a call is emitted to a ubsan
>> runtime function that both checks for, and reports access to inactive
>> members
>
>
> I do not believe this is correct, and I do not believe it can be made to be
> correct. Here's why:
>
> * In C, there is no restriction on accessing an inactive union member. The
> only relevant rule is the effective type rule, in C11 6.5/6, that says that
> you can't store to memory with one type and then load from that same memory
> with an incompatible type. But beware! C is deeply confused, and C11
> footnote 95 says that reading a union member with the wrong effective type
> will produce "the appropriate part of the object representation of the value
> [...] reinterpreted as an object representation in the new type", and then
> refers the reader to another part of the standard which says no such thing.
>
> * In C++, the only standard way to change the active member of a union
> object is by applying placement new to the member (and even that doesn't
> work in some cases). No-one actually writes code that way, so every
> implementation in practice allows the active union member to change in
> various undocumented and non-standard ways.
>
> * It is an explicit goal of UBSan to allow some TUs to be built with it
> enabled and some to be built with it disabled. This check does not and
> cannot support that, because the active union member / effective type can
> change in uninstrumented code. So this does not belong in UBSan.
The primary reason I thought ubsan would be the right place is because
type mismatch check is implemented in ubsan. I also thought about
asking whether ubsan is the right place for this check. According to
my interpretation of the discussions in -core reflector, C DR236, and
in UB list, reasons of undefined-ness of this case are:
a) lifetime of inactive members seems like to end when a member
becomes active (C++-only)
b) multiple subobjects with different type point to the same storage
(C, and C++)

For reference (C++ N3797):
9.5/p1: In a union, at most one of the non-static data members can be
active at any time, that is, the value of at most one of the
non-static data members can be stored in a union at any time. ... The
size of a union is sufficient to contain the largest of its non-static
data members. Each non-static data member is allocated as if it were
the sole member of a struct.

Per your message 25690 to -core reflector, given a union:
  U { int i; float f; } u;
both members' lifetimes will begin when storage for `u` is allocated,
the union has no active member.

 u.i = 0;
means lifetime of `u.f` will end immediately, per 3.8/p1 [...]the
storage which the object occupies is reused[...] So:
 u.f = 0;
is undefined behavior. But:
 new (&u.f) float(0);
is well-defined.

>
> However...
>
> Implementing a sanitizer for C's effective type rule seems like a good idea.
> Such a sanitizer would also be correct in C++ (whose rules are strictly less
> permissive), but not complete. The best way to approach this would be to
> provide shadow memory indicating the effective type of real memory. In order
> to keep things simple, you'd probably want to ignore aggregate types and
> simply track the primitive effective type of memory.
>
> Up to compatibility, there are not many such types (bool, char, short, int,
> long, long long, float, double, long double, and some language extensions),
> so you can probably get away with a 4 bit representation of the type.
> Further, if a byte has effective type of (signed/unsigned/plain) char, the
> adjacent byte must have that type too (due to alignment restrictions), so
> you can use the same type representation for each pair of bytes. There are
> likely to be enough spare values that you can separately encode [4 char], [2
> char][short], [short][2 char], and [short][short], which would let you use a
> single macrotype for each 4 byte chunk of memory. That gives only a 12.5%
> memory overhead, which seems very respectable.
Thanks for the idea!

Which sanitizer do you think fits best for this check? :)

>
>> For above example, sanitizer says:
>>   union-track-active-t1.cpp:11:9: runtime error: active field of union
>> 'S::(anonymous union at union-track-active-t1.cpp:2:3)' is 'd'; access
>> to field 'l' is undefined
>>   union-track-active-t1.cpp:10:7: note: active index set to 'd' here
>>
>> Runtime functions I've added:
>>   struct StringVec {
>>     unsigned NumElts;
>>     const char *Elts[];
>>   };
>>   struct UnionStaticData {
>>     TypeDescriptor *Type;
>>     StringVec* FieldNames;
>>   };
>>
>> extern "C" SANITIZER_INTERFACE_ATTRIBUTE
>> void __ubsan_union_set_active_field_index(uptr Addr, const
>> UnionStaticData *Data,
>>                                             const SourceLocation
>> *ActivationLoc,
>>                                             unsigned ActiveFieldIndex);
>>
>> extern "C" SANITIZER_INTERFACE_ATTRIBUTE
>> void __ubsan_union_check_access(uptr Addr, const SourceLocation *Loc,
>>                                   unsigned FieldIndex);
>>
>> For above code, IR looks like:
>>   // s.d = 42.0;
>>   store double 4.200000e+01, double* %d, align 8
>>   // injected by sanitizer
>>   // __ubsan_union_set_active_field_index(/*union addr*/&s,
>>   // /*type descr=*/'S::anonymous union',
>>   // /*field names*/{"l", "s", "d"},
>>   // /*expr loc=*/"file.cpp:10:7",
>>   // /*field index*/2);
>>   %1 = bitcast %union.anon* %0 to i8*
>>   call void @__ubsan_union_set_active_field_index(i8* %1, i8* bitcast
>> ({ { i16, i16, [56 x i8] }*, { i32, [3 x i8*] }* }* @6 to i8*), i8*
>> bitcast ({ [26 x i8]*, i32, i32 }* @5 to i8*), i32 2)
>>
>>   // if (s.l > 100L) ;
>>   %2 = getelementptr inbounds %struct.S* %s, i32 0, i32 0
>>   %l = bitcast %union.anon* %2 to i64*
>>   %3 = bitcast i64* %l to i8*
>>   call void @__ubsan_union_check_access(i8* %3, i8* bitcast ({ [26 x
>> i8]*, i32, i32 }* @7 to i8*), i32 0)
>>   %4 = load i64* %l, align 8
>>   %cmp = icmp sgt i64 %4, 100
>>
>> I have a few questions regarding the overall design:
>> 1. Do you think this is a useful check?
>> 2. Where can I store type and field info about the union; some form of
>> a shadow memory or a simple array/map?
>> 3. Should sanitizer abort immediately or continue upon detection?
>> 4. When/how can I remove entries from ubsan shadow memory when union's
>> lifetime ends; perhaps in a module pass or at the end of each
>> function?
>>
>> It's far from submission quality at the moment, that's why I am
>> holding back the patch. Before proceeding further, I'd like to get
>> some feedback as to whether this is a valuable feature, and whether I
>> am on the right track.
>>
>> Ismail
>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev