[LLVMdev] Adding support to LLVM for data & code layout (needed by GHC)

Tue Jun 8 03:42:41 PDT 2010

Hi All,

The GHC developers would like to add support to llvm to enable the
order that code and data are laid out in, in the resulting assembly
code produced by llvm to be defined by the user. The reason we would
like to have this feature is explained in the blog post on GHC's use
of llvm here: http://blog.llvm.org/2010/05/glasgow-haskell-compiler-and-llvm.html,
specifically under the title, 'Problems with backend'.

Basically we want to be able to produce code using llvm that looks like this:

.text
    .align 4,0x90
    .long  _ZCMain_main_srt-(_ZCMain_main_info)+0
    .long  0
    .long  196630
.globl _ZCMain_main_info
_ZCMain_main_info:
.Lcg6:
    leal -12(%ebp),%eax
    cmpl 84(%ebx),%eax
    [...]

So in the above code we can access the code for the function
'_ZCMain_main_info' and the metadata for it need by the runtime with
just the one label. At the moment llvm just outputs all global
variables at the end.

It seems to me that there are three slightly different ways to support
this in llvm:

1) Have llvm preserve order of data and code from input file when in
the same section

2) Use a new special '@llvm.foo' variable that takes a list of
functions and globals. Order they appear in the array is the order
they should be output in and as one contiguous block.

3) Have llvm be specifically aware about the desire to associate some
global variable with a function. So a function definition could
include taking a global variable as an attribute. llvm would then
output the function and variable together like in the code above.

I was thinking that the first option is the easiest, both for llvm and
its users. My simple idea was to just somehow store the order that
functions and globals are read in by AsmParser or created in by using
the API. You could use a list to do this or just give each
global/function a number representing its order that could be sorted
on. When it comes for AsmPrinter to write out the module it does so in
order. Any new functions or globals created by optimisations are
simply added to the end of the sort order. This would produce the
above code but also with a label for the data, like this:

.text
    .align 4,0x90
_ZCMain_main_info_table:
    .long  _ZCMain_main_srt-(_ZCMain_main_info)+0
    .long  0
    .long  196630
.globl _ZCMain_main_info
_ZCMain_main_info:
.Lcg6:
    leal -12(%ebp),%eax
    cmpl 84(%ebx),%eax

The problem could be optimisations though. this is an area I'm not
very knowledgeable in so please point out any issues. The main problem
I can think of are inlining and dead code removal. Inlining I believe
should be OK as long as it doesn't remove the original function since
we need that label present to access the data before it. I wouldn't
think this will happen though since the function will be accessed both
as a tail call to it and using pointer arithmetic with subtraction to
get the data before it. The pointer arithmetic would stop llvm
removing it. The other issue is dead code removal removing the data
('_ZCMain_main_info_table') since there are no references to it. That
can be easily fixed using @llvm.used. The biggest problem with this
approach is that it limits the optimisations llvm can do on this code.
If the third approach was taken for example llvm could optimise more
aggressively and specifically for the situation. I'm also not exactly
sure how link time optimisation would figure into this at the moment
so perhaps that's a big issue.

So thoughts, criticisms, alternative suggestions please.

Cheers,
David