[LLVMdev] Union Type

Reid Spencer reid at x10sys.com
Fri Dec 19 04:25:01 PST 2003


As a side effect of bug 178 (Stacker not handling 64-bit pointers on
Solaris), I got thinking about a union type for LLVM.   Is there any
good reason that LLVM shouldn't support unions? This is essentially a
structure that has its members all at the same address rather than at
sequential addresses. I know there are various issues with unions
(alignment, etc.) but wouldn't it make sense to provide a union type
that deals with all those issues in a platform independent way? 

The reason this comes up is because the idiom of saving space by using a
memory location for storing different types of data is quite frequent. 
For example, suppose you want to store both an "int" and a "char*" in a
single slot in an array. Each slot can have only one or the other type
of value at any given time.  There are three ways to do this: structure,
casting, or union:

1: % foo = type { int, char* };
2: % foo = type { int };
3: % foo = union { int, char* };

Number 3 doesn't exist in LLVM and is what I'm proposing. In the first
case, we incur a memory object that has both an int and a char* with
non-overlapping sequential addresses. This wastes space since both the
int and the char* will never be concurrently used.  In the second case
we have just an int that could be casted to a char* but that might cause
undefined results if the size of a char* is larger than the size of an
int. 

The third option, union, is the compromise. It says, "make the memory
object as large as the largest element but have all elements start at
(or near) the same memory address". The "or near" part is necessary
because alignment rules might cause one of the members to start at a
non-zero offset from the start of the memory object.

In my particular case in Stacker, I have tried to do something like:

%foo = global [10 x int];

void %func() {
    %int_ptr = getelementptr [10 x int]* %foo, long 0, long 0;
    %int_val = load int* %int_ptr;
    %char_ptr = cast int %int_val to char*;
    %oops = load char* %char_ptr;
}

The above will probably work on a 32-bit platform where pointers are the
same size as int. However, on a 64-bit platform, a pointer is the same
size as a long and the value retrieved from the array would actually
span two entries in the array, one of which could have been corrupted by
a previous write of an integer into the array.  Yes, I know, I should
have chosen the pointer type as the basis for the array .. but then, it
wouldn't work reliably on an 8086 with segmented memory model :)

While various tests for word sizes and alignment rules could be used,
this problem is _gracefully_ handled by unions.  To rewrite the example
above we would use something like:

%a_union = union { int, char* };

%foo = global [ 10 x %a_union ];

void %func() {
    %int_ptr = getelementptr [10 x %a_union]* %foo, 
        long 0, long 0, ubyte 0;
    %int_val = load int* %int_ptr;
    %char_ptr = getelementptr [10 x %a_union]* %foo,
        long 0, long 0, ubyte 1;
    %good = load char* %char_ptr;
}

This effectively does the same thing as the first but the union takes
care of the word length and alignment issues for us.

If anyone thinks that unions are bad ideas, I challenge you to create a
computer that doesn't support an OR operation. For data structures,
unions fill the same role: structures are AND, unions are OR. Unions
only get dicey when they are incorrectly disambiguated .. but that's a
source language compiler writer's problem.

What think ye?

Reid.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20031219/05a75823/attachment.sig>


More information about the llvm-dev mailing list