[LLVMdev] Suggestion: Support union types in IR

Thu Jan 1 06:25:43 PST 2009

On Thursday 01 January 2009 00:57:50 Talin wrote:
> Jon Harrop wrote:
> > I think very few people are still using C-style unions so there is
> > probably little desire to put them into LLVM itself. Class hierarchies
> > and algebraic datatypes are much more common and use a different
> > representation (that LLVM can already handle with bitcasts).
> > Specifically, as a pointer to one of several struct types that can vary
> > in size.
> >
> > I am considering a design along the following lines. All algebraic
> > datatypes are stored as a pointer to an int tag:
> >
> >   i32 *
>
> Exactly. I'm not especially interested in C-style unions, I'm interested
> in discriminated unions. But the actual discriminator field is easily
> represented in LLVM IR already, so there's no need to extend the IR to
> support them. That's why I am only asking for C-style union support -
> it's a lower-level primitive that you can use to build discriminated
> unions on top of.

I don't think you would want to build discriminated unions on top of C-style 
unions though.

> The problem with varying-size structs is that it would be nice to be
> able to pass them as function parameters and function return values
> using LLVM's support of structs as first-class data types.

As I understand it, LLVM does not provide first-class structs and attempting 
to return arbitrary structs from functions will randomly break. I believe you 
need to implement that yourself in whatever way your target language 
requires.

> That requires that the code that handles argument marshalling know the size
> of the largest possible type that can be passed.

Pass a pointer and that size will always be 1 word. You can still stack 
allocate the space if you must and LLVM's optimization passes will do their 
best to completely remove that overhead when possible.

> > The tag conveys the true type of the value and, indeed, it may even be a
> > pointer to comprehensive run-time type information. So you load the tag
> > and examine it in order to determine what type to cast to. Then you
> > bitcast to another type if you need to load any arguments (that appear
> > after the tag). For example, an int option type might cast as follows:
> >
> >   None   (tag=0) -> {i32} *
> >   Some n (tag=1) -> {i32, i32} *
>
> Since the language I am working on is statically typed, I was going to
> select a type discrimination strategy based on the actual types
> involved. For example, in the case of a union of reference types,
> there's no need for a discriminator field,

Only if they are all different reference types. Otherwise you might have the 
situation:

  type t = Negate of t | Factorial of t

Where both constructors carry an argument of the same type (t) and you do 
still need a tag to distinguish between the cases (Negate and Factorial) in 
the union.

> since reference types already 
> carry around their own type information. On the other hand, if you have
> a union of two enum types of limited range, the high bits of the enum
> would be available. In cases such as int or float where all the bits
> were used, then a separate discriminator field would be needed.

There is certainly some scope here and I have been thinking about similar 
things. I would like both run-time type information and efficient pattern 
matching over algebraic datatypes.

The run-time type information can be a field from every reference type that 
points to its run-time type. When pattern matching, you usually want to 
bisect the number of cases covered in the union to perform a balanced search 
and achieve the best performance. That usually assumes the presence of 
consecutive integers as tags so the use of effectively-random pointers would 
throw that off a bit, especially if the run-time type information is 
subjected to a moving garbage collector. Alternatively, I could put a 
redundant tag in each value on the heap...

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e