[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

Fri Jul 6 05:25:58 PDT 2012

On Fri, Jul 6, 2012 at 6:39 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk> wrote:
>
> On Jul 5, 2012, at 9:06 PM, Anthony Blake <amb33 at cs.waikato.ac.nz> wrote:
>
>> I've noticed that LLVM tends to generate suboptimal code and spill an
>> excessive amount of registers in large functions, such as in those
>> that are automatically generated by FFTW.
>
> One problem might be that we're forcing the 16 stores to the out array to happen in source order, which constrains the schedule. The stores are clearly non-aliasing.
>
>> LLVM generates good code for a function that computes an 8-point
>> complex FFT, but from 16-point upwards, icc or gcc generates much
>> better code. Here is an example of a sequence of instructions from a
>> 32-point FFT, compiled with clang/LLVM 3.1 for x86_64 with SSE:
>>
>>        [...]
>>       movaps  32(%rdi), %xmm3
>>       movaps  48(%rdi), %xmm2
>>       movaps  %xmm3, %xmm1     ### <-- xmm3 mov'ed into xmm1
>>       movaps  %xmm3, %xmm4     ### <-- xmm3 mov'ed into xmm4
>>       addps   %xmm0, %xmm1
>>       movaps  %xmm1, -16(%rbp)        ## 16-byte Spill
>>       movaps  144(%rdi), %xmm3   ### <-- new data mov'ed into xmm3
>>        [...]
>>
>> xmm3 loaded, duplicated into 2 registers, and then discarded as other
>> data is loaded into it. Can anyone shed some light on why this might
>> be happening?
>
> I'm not actually seeing this behavior on trunk.
>

I've just tried trunk, and although behavior like above isn't
immediately obvious, trunk generates more instructions and spills more
registers compared to 3.1.

amb