[llvm-dev] [frontend-dev][beginner] Allocation of structures

Thu Jun 1 17:12:11 PDT 2017

Sorry for this incredibly long response. It's just that probably about 90%
of all beginner questions seem to be related to unfamiliarity with the same
set of basic concepts (which, to be honest, are very poorly known outside
the compiler/toolchain community, so that unfamiliarity is totally
expected). By providing a more elaborated answer I'm trying to write
something that hopefully helps many other people understand these concepts
better.

Broadly speaking, it may be useful to follow this approach when facing a
question while learning about LLVM:

1. If you are starting from C++ code and trying to understand how it is
lowered, run `clang++ -emit-llvm -O0` and create a C file that produces
essentially the same LLVM IR when passed to `clang -emit-llvm -O0`. This
should always be possible, except perhaps things related to exception
handling and some things you probably don't care about like how to put a
function in a comdat or give it certain linkage. If you can't figure out
how to do that, the answer is probably:
1.a. in the C++ ABI https://itanium-cxx-abi.github.io/cxx-abi/abi.html
(home page: https://itanium-cxx-abi.github.io/cxx-abi/) or
1.b. it is a general C++ language question which you can probably find
information about in http://en.cppreference.com/w/ (or the language
standard if you want to look at it).

Those are fairly dense reference material, but you should at least browse
around them a bit (and do web searches, especially for general C++ language
questions) to refine your question.

2. If your question wasn't related to something C++ specific, then you now
have a C file and want to know something about how it is lowered to LLVM
IR. To make sure you have a basic understanding of LLVM IR concepts and the
relationship between C and LLVM IR, I would recommend watching this video:
https://www.youtube.com/watch?v=haQ2cijhvhE

3. It's likely that you started with a single question at the beginning and
now have broken it down into multiple more specific questions that are much
more likely to be answered if you ask them on llvm-dev (or cfe-dev if the
question is very narrowly C++ specific, though for general C++ language
questions StackOverflow might be better).

Another useful resource might be to skim these two videos, which talk about
C++ coroutines (and their implementation in LLVM), which are a feature that
cuts across the C++, C, and LLVM IR levels (and a little bit about machine
level in the second video):
https://www.youtube.com/watch?v=8C8NnE1Dg4A
https://www.youtube.com/watch?v=Ztr8QvMhqmQ
You don't need to actually understand what the videos are talking about in
detail, but it is useful to see how Gor talks about the same thing to two
very different audiences. The first video is to an audience at CppCon which
is mostly C++ language users (and people familiar with C-like aspects of
C++ programming), but not necessarily familiar with compilers. The second
video is to an audience of LLVM developers, and so it doesn't worry about
mentioning lots of LLVM-specific details or using compiler/toolchain
terminology and concepts. By comparing the two videos, it should give some
insight about which things are easier to think about at different levels of
analysis, which things work similarly and can be explained at multiple
levels of analysis, and which things only matter at a particular level of
analysis. For example, the high-level description of the coroutine
splitting process can be done pretty much the same at LLVM IR level or at
the C level (with some intrinsics). You'll also notice that even when
talking to an audience of LLVM developers, many things in the presentation
are still described at the C level.
Coroutines are also interesting in that part of the language-level semantic
lowering (the coroutine splitting) happens inside the optimizer at LLVM IR
level instead of in the frontend for various reasons. Having an
optimization pass that is *semantically required* is unusual, but otherwise
this is conceptually similar to the "alloca trick" I talk about below that
Clang uses for local variables (SROA / mem2reg are not *semantically
required for correctness*, but they are needed for any reasonable
performance, so the situation is still pretty analogous).

On Jun 1, 2017 6:47 AM, "Dimitri Racordon via llvm-dev" <
llvm-dev at lists.llvm.org> wrote:

Thanks for this comprehensive answer!
I’m slowly but surely to understand better about all this.

That may sound scary,

Yes it does! That said, I’m not sure need to care that much about the C++
ABI, as I don’t plan on linking my IR with C++ directly. It is very
interesting to see how it works though, as I’ll probably need to solve the
same kind of problems for my own frontend language.

The result of an alloca is a pointer to the stack frame of the current
function, so returning it doesn't make sense (like returning the address of
a local variable in C). Again, this problem isn't really related to LLVM
per se and the easiest way to think about it is in terms of how to lower to
C.
Decisions about object destruction are up to the frontend. At the C or LLVM
level they just turn into function calls or whatever you use to implement
the semantics that the frontend requires. In other words, if you want an
object to be destroyed, it's up to your frontend to emit code to perform
the destruction wherever the destruction is expected to occur.

I understand my second question was probably very unclear. When I said
“destroyed”, I wasn’t talking about calls to C++ (or any other frontend
language) destructors, but more about what is happening to the memory that
was stack allocated. Let’s consider the following IR:

define i32 @f() {
entry:
  %0 = alloca i32
  store i32 123, i32* %0
  %1 = load i32, i32* %0
  ret i32 %1
}

define i32 @main() {
entry:
  %0 = alloca i32
  %1 = call i32 @f()
  store i32 %1, i32* %0
  %2 = load i32, i32* %0
  ret i32 %2
}

As I understand it, f will allocate a new 32 bit memory space on its stack
frame, and stores the value 123 in it. But what I’m not sure about is
what’s happening to that alloca when f returns.

The alloca is part of `f`'s  stack frame, so when `f` returns it is gone.
That memory can never be accessed again (obviously at the machine level it
is still there and will probably be overwritten by a future function call,
but as far as LLVM IR semantics (or C) it cannot be accessed again). To
phrase this in terms of C, you are essentially asking about what happens to
the memory pointed to by `p` in the function `bar` after it returns (and I
think you already know the answer at the C level; the answer is the same at
the LLVM IR level).

 If the alloca is in the frame of f, then the value stored at that location
> should be deallocated (which is what I meant by "destroyed") when f
> returns, and so I don’t understand what makes the `store i32 %1, i32* %0`
> successfully storing 123 in the alloca of the main function.

The store inside `f` does not affect the alloca inside the main function.
The main function takes the value returned by the call and stores the value
itself. The return value of `f` is an i32 value that you can think of as
held in a register at the hardware level and returned by value. In the
callee (`@f`) the value 123 happens to have been stored to and loaded back
from an alloca, but that doesn't matter after it returns (and indeed, the
alloca's memory is deallocated when `@f` returns). Also, you'll notice that
the optimizer will happily reduce `@f` to just `ret i32 123` or something
like that and delete the load, store, and alloca.

I'm guessing that part of this confusion is that you were seeing Clang
produce a ton of alloca's. You may wonder *why* clang generates all these
alloca's (and you may have noticed the optimizer will just fold them away).
This is one place where the semantics of C and LLVM IR differ
substantially, so thinking in terms of C won't lead to the right conclusion.

The fundamental abstraction in C of data is essentially that every variable
is assumed to be an object in memory (more precisely, this is what the term
"lvalue" in C refers to; if you want to get super technical about this and
related concepts in C/C++ a reference is
http://en.cppreference.com/w/cpp/language/value_category). For example, you
can take the address of any variable.

For example, these two functions are identical

int foo() {
  int x = 123;
  return x;
}

int bar() {
  int x = 123;
  int *p = &x;
  return *p;
}

Also, in C/C++, variables are mutable. For example, this function is also
identical to the previous functions:

int baz() {
  int x = 123;
  x = 42;
  x = 123;
  return x;
}

However, in LLVM IR, when you see something like `%2 = add i32 %1, %0` or
`%2 = load i32, i32* %0`, the thing designated by `%2` is *very* different.
You cannot take the address of `%2` and you cannot mutate the thing
designated by `%2`. You can think of `%2` as a name for a node in a data
flow graph (in fact, `%2` is represented in the LLVM libraries as an
llvm::Value; %2 is a *name* for an llvm::Value). There's more about this
perhaps subtle point from 10:40 to 11:40 in Chandler's video
https://www.youtube.com/watch?v=haQ2cijhvhE, though you will probably find
the first 20 minutes or so very relevant to this conversation (and I would
recommend watching the whole video).

(Sometimes it is also useful to think of LLVM's values as "SSA registers",
i.e. registers that can only be assigned in one place. Don't think about
them as general mutable storage locations though)

So Clang has this problem: how do I accurately translate the semantics of
code written in C/C++ (which semantically involves operations on objects in
memory) to LLVM IR, which revolves around nodes in a data flow graph (whose
SSA values have no notion of semantically being in a "storage location";
they are just names for llvm::Value's that the LLVM libraries operate on,
which are (roughly) nodes in a data flow / control flow graph).

The answer is that Clang models C/C++ semantics of local variables by
explicitly using alloca to create temporary in-memory storage locations.
Then the correspondence between the generated LLVM IR and the C/C++
semantics is easy to get right. This requires inserting a lot of loads and
stores from allocas just to access the values (e.g. needing two loads, an
add, and a store to do `c = a + b`), but it is worth it because it
simplifies Clang's lowering from C/C++ to LLVM IR.

You may be thinking: in `foo` above, at the C/C++ level it is easy to see
that `x` could be kept in a register instead of a memory location on the
stack, and it is only assigned once, so it could be directly lowered into
an LLVM IR value instead of kept in memory in an alloca. It turns out that
it is actually very difficult to do this analysis in Clang, at least in any
reasonable generality. It turns out that there are standard compiler
algorithms (generally known as "SSA construction") that can "easily" and
quickly (e.g. linear time) do the analysis and transformation needed to
optimally (or close to optimally) get rid of all the allocas. The discovery
of an efficient algorithm to do this was actually a revolution in compilers
precisely because it allows an SSA representation like LLVM IR to be
efficiently constructed from input languages with mutable storage locations
(especially in the presence of control flow, which requires deciding where
to insert a special type of instruction called a "phi node" to model when a
local variable is mutated along multiple control flow paths).
The `mem2reg` pass (and it's more sophisticated cousin, `sroa`) are the
passes in LLVM that do this.

If I had to make a guess, I would say the value pointed by %0 in f is
copied to the alloca pointed by %0 in main (which is what I meant by “copy
from the runtime of the callee to that of the caller”).

You are correct in the sense that @f loads the value (`%1 = load i32, i32*
%0`), then returns it (`ret i32 %1`), then control returns to main and `%1
= call i32 @f()` finishes executing, then @main stores the value %1 into %0
`store i32 %1, i32* %0`.
There is no magic here connected to the alloca's; you're explicitly seeing
every operation done on those memory locations (which are distinct memory
locations). The only thing that could perhaps be considered "magic" are the
details of how the LLVM backends generate machine code so that the value
passed to the `ret` becomes available to the caller. But the semantics at
the LLVM IR level are straightforward: the value of the call instruction
(%1 in main) is the same as the value passed to the `ret` in the callee (%1
in f).

 With primitive types such as i32, this probably doesn't matter, since they
> would fit in a CPU register, but what if %0 was an alloca for a very large
> struct? I guess that’s what RVO is all about optimising, but I’d like to
> confirm that assumption.

Are you asking about ABI details or LLVM IR semantics?

"RVO" doesn't enter the picture here because we are at LLVM IR level. There
is no RVO at the LLVM IR level because RVO refers to a concept that exists
only at the C++ language level (in the abstract; it's not part of the ABI).
In fact, if you look on the wikipedia page for RVO, you'll see that RVO can
change a well-defined program's console output (i.e. the number of times it
prints `A copy was made.`). RVO is a specific case where the C++ language
standard (the semantics of the language in the abstract; this has nothing
to do with the ABI) permit the compiler to change the output of a
well-defined program, and the set of situations where this transformation
is allowed is defined in terms of the abstract language semantics, which
are only available to the C++ frontend (which has complete language-level
semantic information). This is a case where thinking about LLVM IR in terms
of C code is accurate: if you are lowering C++ to C, then as part of that
conversion RVO is performed, and there is no concept of RVO at the C level.

Semantically at the LLVM IR level a struct return type is returned by value
(i.e. copied) exactly like an i32. `%1` in `@main` represents a value with
the same bit pattern and size as `%1` in `@f`; this is true regardless of
the size of `%1` in `@f`. The alloca inside main is irrelevant to this. The
alloca only matters because the value of `%1` inside `@main` is stored into
the memory it points to by an explicit store instruction, but that is just
a store like any other store (it sets the bits in memory at the address
pointed at by %0 to the bits of the value of %1; the memory address happens
to be an alloca that is allocated in the current stack frame).

If you are wondering about the low-level details of how `%1` in `@f` ends
up in `%1` in `@main` when the call happens at runtime, then that's a bit
more complicated and depends on the processor-specific ABI. At the
processor ABI level, a struct return type of an LLVM IR function is
returned by value in registers (just like in C). Like I mentioned, there's
a list of designated return registers (on x86-64 there are two such
registers), with specially allocated stack memory to act as extra
"registers" beyond the two designated hardware registers (if necessary).
This specially allocated stack memory is not explicitly visible at the LLVM
IR level and comes into existence in the LLVM backend as LLVM IR is
translated into a form closer to machine level.

The exact ABI details of a function call at the LLVM IR level has quite a
rabbit hole that you probably don't care about. E.g. there is `sret`,
`byval`, and `inalloca` (
http://llvm.org/docs/LangRef.html#parameter-attributes) or calling
conventions (http://llvm.org/docs/LangRef.html#calling-conventions) which
control very low-level details of the mapping to the processor-specific ABI.

Another related thing (specifically related to C++):

The fact that the processor specific registers (or special stack memory
representing extra "registers") is not explicitly visible at the LLVM IR
level is not an issue when lowering C where all structs are POD (Plain Old
Data). That is why I said that you can think about this particular issue
entirely in terms of lowering C++ to C. In C++, a struct can define
constructors/destructors that allow it to maintain complex internal
invariants. In particular, it can maintain internal invariants that depend
on it being in memory, so the C++ language ABI cannot say that such structs
are passed "by value" in registers. For example,

struct Foo {
  Foo(const Foo &foo) {
    this->x = foo.x;
    this->p = &this->x;
  }
  int x;
  int *p;
};

As part of a simple function like the following, the struct Foo that is
returned "by value", but if you think about it in terms of lowering C++ to
C (or LLVM IR), it actually involves a call to a copy constructor:

Foo getFoo(Foo &f) {
  return f;
}

(look at the generated LLVM IR: https://godbolt.org/g/VbwwJr)

This struct `Foo` cannot be returned by value "in registers" because it
maintains an internal invariant that depends on being in memory (`p` points
at `x`). Due to this language-level semantic detail, what would `p` be set
to if this struct were returned in registers? (and it does fit in registers
on x86-64 actually)

Because this is a language ABI requirement, all compilers must implement it
identically in order to produce machine code that can be linked together.
Therefore, it cannot depend on "how smart" each compiler's optimizer is and
is defined at the C++ language level. For example, you don't want
compiler's disagreeing about whether getFoo will return in registers or in
memory depending on whether they optimize this branch (which eliminates
Foo's dependence on its own address, therefore allowing it to in theory be
returned in registers):

struct Foo {
  Foo(const Foo &foo) {
    this->x = foo.x;
    if ((foo.x ^ foo.x) != 0) // Always false.
      this->p = &this->x;
    else
      this->p = nullptr;
  }
  int x;
  int *p;
};

Thus, the rules frequently require things to be passed via explicit pointer
out-parameters even if they in theory could be passed in registers. For
example, the requirement preventing Foo from being passed in registers in
https://itanium-cxx-abi.github.io/cxx-abi/abi.html#return-value applies to
structs that are "non-trivial for the purposes of calls" which basically
means "did you manually write a copy/move constructor or destructor for
this class".

As an interesting factoid, this prevents std::unique_ptr from being passed
in registers, even though it is just a single pointer and doesn't depend on
its own address, so could in theory be passed in registers. So converting
an old-style:

Qux *createQux();

to a modern-style:

std::unique_ptr<Qux> createQux();

is actually a slight (probably negligible) pessimization, because the ABI
requirement prevents the latter from being returned in registers. The
optimizer is actually smart enough to optimize this and instead pass the
std::unique_ptr in registers, but the fact that it must "agree with other
compilers" about it means it has its hands tied and has to follow the ABI.
However, if you compile with link-time optimization, the compiler can
verify that it is seeing every possible use of createQux and change them
all at once to use registers instead. If createQux is `static`, the
compiler can also verify that it is seeing every possible user and perform
this optimization. Also, if createQux is inlined, the cost will obviously
go away.

The ABI actually prevents the optimizer from passing the std::unique_ptr in
registers, simply because all compilers (or different versions of the same
compiler, or assembly language programmers) must agree on whether it is
passed in registers or not for their code to be compatible. So when people
say that std::unique_ptr has "zero overhead" you can now tell them that
they are technically wrong (they are right in practice though, because
except for ABI limitations compilers optimize std::unique_ptr extremely
well).

As you can see, this is one case where the lowest-level hardware realities
(like passing things in registers or not) require consideration at the
language semantic level. This is one aspect that makes ABI's have a very
deep rabbit hole and lots of details, because the lowest-level details can
have high-level effects. They also want to be high-performance, but they
sometimes must sacrifice that for compatibility or to avoid being
infeasibly complex.

-- Sean Silva

Once again, I apologise for my poor knowledge of these kind of low-level
semantics.

Best,

Dimitri Racordon
CUI, Université de Genève
7, route de Drize, CH-1227 Carouge - Switzerland
Phone: +41 22 379 01 24 <+41%2022%20379%2001%2024>

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170601/658d04bf/attachment.html>