[llvm-dev] RFC: Adding argument allocas

Reid Kleckner via llvm-dev llvm-dev at lists.llvm.org
Tue Dec 27 16:15:02 PST 2016


To recap, it seems like there are two people so far opposed to this
proposal, and tentative support from a number of others. It's not clear to
me if I should go ahead with this. I'll try to bug some more people to get
more input here.

In the meantime, I think I'll implement the simple pattern matching, since
that can be done incrementally, and can be simplified afterwards by the IR
change, or we can decide to extend it to run at -O0 or handle more complex
cases.

On Thu, Dec 8, 2016 at 5:05 PM, Reid Kleckner <rnk at google.com> wrote:

> Clang is currently missing some low-hanging performance and code size
> opportunities when receiving function parameters. For most scalar
> parameters, Clang typically emits an alloca and stores the LLVM SSA
> argument value into the alloca to initialize it. With optimizations, this
> initialization is often removed, but it stays behind in debug builds and
> when the user takes the address of a parameter (
> https://llvm.org/bugs/show_bug.cgi?id=26328). In many cases, the memory
> allocation and store duplicate work already done by the caller.
>
> Case 1: The parameter is already being passed in memory. In this case, we
> waste memory and do an extra load and store to copy it into our own alloca.
> This is very common on 32-bit x86.
>
> Case 2: On Win64, the caller pre-allocates shadow stack slots for all
> register parameters. This allows the callee to “home” register parameters
> to ease debugging and the implementation of variadic functions. In this
> case, the store is not dead, but we fail to use the pre-allocated memory.
>
> Case 3: The parameter is a register parameter in a normal Sys-V ABI. In
> this case, nothing is wasted. Both the memory and store are needed.
>
> I think we can improve our code for cases 1 and 2 by making it possible to
> create allocas from parameters, and we can lower this down to a regular
> stack object with a store in case 3. The syntax for this would look like:
> define void @f(i32 %x) {
>   %px = alloca i32, argument i32 %x, align 4
>
> The semantics are the same as the following:
> define void @f(i32 %x) {
>   %px = alloca i32, align 4
>   store i32 %x, i32* %px, align 4
>
> It is invalid to make one of these argument allocas dynamic in any way,
> either by using inalloca, using a dynamic element count, or being outside
> the entry block.
>
> If the semantics are the same, it begs the question, why don’t we pattern
> match the alloca and store to elide dead stores and reuse existing argument
> stack slots? My main preference for adding a new way to do this is that it
> gives us simpler and smaller code in debug builds, where we presumably
> don’t want to do this kind of pattern recognition. My experience with our
> -O0 codegen is that we do a lot of useless copies to initialize parameters,
> and this leads to larger output size, which slows things down. Having a
> more easily recognizable idiom for getting the storage behind parameters if
> it exists feels like a win.
>
> Changing the semantics of alloca affects a lot of things, but I think it
> makes sense to extend alloca here rather than a new intrinsic or
> instruction that can create local variable stack memory that passes would
> have to reason about. Here’s a list of things we’d need to change off the
> top of my head:
> 1. Inliner: When hoisting static allocas to the entry block, the inliner
> will now strip the argument operand off of allocas and insert the
> equivalent store at the inlined call site.
> 2. Mem2reg: Loading from an argument alloca needs to produce the argument
> value instead of undef.
> 3. GVN: We need to apply the same logic to GVN and similar store to load
> forwarding transforms.
> 4. Instcombine: This transform has some simple store forwarding logic that
> would need to be updated.
>
> I’m sure there are more, but the changes seem worth it to get at these low
> hanging opportunities.
>
> One other questionable side benefit of doing this is that it would make it
> possible to implement va_start by taking the address of a parameter and
> doing pointer arithmetic. While that code is fairly invalid, it’s exactly
> what the MSVC STL headers do for 32-bit x86. If we make this work in Clang,
> we can remove our stdarg.h and vadefs.h wrapper headers. Users often pass
> flags that cause clang to skip these wrapper headers, and then they file
> bugs complaining that va lists don't work.
>
> At the end of the day, this feels like a straightforward engineering
> improvement to LLVM, and it feels worth doing to me. Does anyone feel
> otherwise or have any suggestions?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161227/bd485de2/attachment.html>


More information about the llvm-dev mailing list