[llvm-dev] [RFC] Spill2Reg: Selectively replace spills to stack with spills to vector registers

Thu Jan 27 10:27:09 PST 2022

+1. Liked the overall approach and thought process. BTW, in case you didn't
notice, AMDGPU backend does have a similar pass in place.

I have some suggestions/thoughts.

> Please note that the profitability heuristic is implemented as a
target-dependent components, so other targets can implement their own
specialized heuristics.

Was there any thought about a finer control? e.g. Controlling spilling
decisions for a certain region of the code like loop. Attaching metadata to
a loop can help control the decisions. It is quite possible that spilling
to vector registers is profitable in one loop but not other. Designing
robust heuristics is a hard problem.

On Thu, Jan 27, 2022, 11:41 PM Vasileios Porpodas via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Thanks for your comments Sjoerd,
>
> So far I have only tried a small synthetic test on AArch64, like the one
> with the back-to-back spills and reloads, and it seems to improve
> performance there too. I definitely need to do some more extensive
> performance evaluation, including AArch64.
> I have seen a few SPEC benchmarks improve, not just x264, but this depends
> on the compiler options used and the exact compiler commit checked out. But
> yeah the performance improvements are usually around .5%.
>
> Vasileios
>
> On Thu, Jan 27, 2022 at 7:59 AM Sjoerd Meijer <Sjoerd.Meijer at arm.com>
> wrote:
>
>> Same here, enjoyed reading this. Just a drive by comment about this:
>>
>> > Spill2Reg also works on real-life applications: In SPEC CPU 2017
>> 525.x264 Spill2Reg improves performance by about 0.3%.
>>
>> Our codegen for SPEC is relatively inefficient and there is a lot to gain
>> here (compared to GCC), especially for x264 where we are more than 20%
>> behind. That's for AArch64 by the way, but since we are mostly fixing
>> target independent things, I guess that's also the case for X86. While
>> this 0.3% is nice (every little helps), I think things would look quite
>> different when the inefficiencies are addressed (we are working on
>> it/some), and spill2reg that is fixing up things now may not be able to do
>> so when that is the case. Long story short, we only have a 10% uplift for 1
>> case, but we need some more performance numbers? Ideally for AArch64 too?
>> 🙂
>>
>> But thanks again for the write up and the work, really nice.
>>
>> Cheers.
>> ------------------------------
>> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
>> Vasileios Porpodas via llvm-dev <llvm-dev at lists.llvm.org>
>> *Sent:* 27 January 2022 07:34
>> *To:* Philip Reames <listmail at philipreames.com>
>> *Cc:* llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
>> *Subject:* Re: [llvm-dev] [RFC] Spill2Reg: Selectively replace spills to
>> stack with spills to vector registers
>>
>> Thanks Philip for the feedback, I am glad you enjoyed reading it.
>>
>> Vasileios
>>
>> On Wed, Jan 26, 2022 at 8:29 PM Philip Reames <listmail at philipreames.com>
>> wrote:
>>
>> This is a wonderful writeup of an interesting design problem, and the
>> chosen solution.  I learned several things from reading it.  Thank you
>> taking the time to write this up.
>>
>> My takeaway after reading was that you've clearly thought through all the
>> issues in some depth.  I might not lean the same direction on each choice,
>> but you've more than convinced me that you've thought about them and made a
>> reasoned choice.  Solid +1 on the design here.
>>
>> Philip
>> On 1/26/22 6:57 PM, Vasileios Porpodas via llvm-dev wrote:
>>
>> Hi everyone,
>>
>> This is an RFC for a new machine IR pass called Spill2Reg, which
>> selectively replaces spills to the stack with spills to vector registers.
>> The chain of patches can be found here: https://reviews.llvm.org/D118298
>> (it includes all patches from D118298 to D118305).
>>
>> ### Overview
>> The register allocator emits spill/reload instructions to temporarily
>> save register values to memory. These are typically stores to the stack
>> (aka spills) and loads from the stack (aka reloads or fills). These
>> instructions hurt performance not only because they increase the number of
>> instructions executed, but also because they add pressure to the memory
>> resources, which may already be heavily used by the actual memory
>> instructions of the workload.
>>
>> Spill2Reg aims at reducing this overhead by selectively replacing
>> spills/reloads to/from the stack with spills/reloads to/from registers only
>> when it is profitable. The target registers that will hold the spilled
>> value must be of a different class than those that caused the spill, and in
>> architectures like x86 or ARM we can use vector registers to save values
>> from general purpose registers. Spill2reg can be profitable in a couple of
>> cases: (i) on targets where spills to stack are always slower than spills
>> to registers, (ii) in pathological cases with lots of spill/reload
>> instructions back-to-back, (iii) in memory intensive workloads. It is worth
>> pointing out that Spill2Reg can be profitable even on targets where spills
>> to registers and spills to stack have a similar latency. This is because
>> replacing some of the stack instructions with register instructions can
>> help remove some stalls caused by bottle-necks in the memory resources.
>>
>> Early evaluation on a Skylake(Server) x86_64 system shows that Spill2Reg
>> can improve performance of both synthetic tests and of real-life workloads.
>>
>>
>> Why Spill to Registers?
>> =======================
>> There are a couple of reasons why it makes sense to spill to registers
>> instead of memory. To summarize: (i) there is usually a lot of free
>> register space even when spilling and (ii) spilling to vector registers can
>> remove back-end stalls. The following sections discuss these points in more
>> detail.
>>
>>
>> 1. Free register space even when spilling
>> -----------------------------------------
>> Consider the following code:
>> ```
>> int D0, D1, D2, ..., D18;
>> foo() {
>>    int t0 = D0
>>    int t1 = D1
>>    ...
>>    int t18 = D18
>>    // Some code
>>    ... = t0
>>    ... = t1
>>        ...
>>    ... = t18
>> }
>> ```
>>
>> Variables t0 to t18 are all live across the middle point (marked with `//
>> Some code`). When compiled for x86_64, this code will assign t0 to t14 to
>> registers, but will spill t15 to t18.
>> Here is what the assembly looks like:
>> ```
>>         movl    D0(%rip), %eax
>>         movl    %eax, -8(%rsp)                  # 4-byte Spill
>>         movl    D1(%rip), %ecx
>>         movl    D2(%rip), %edx
>>         movl    D3(%rip), %esi
>>         movl    D4(%rip), %edi
>>         movl    D5(%rip), %r8d
>>         movl    D6(%rip), %r9d
>>         movl    D7(%rip), %r10d
>>         movl    D8(%rip), %r11d
>>         movl    D9(%rip), %ebx
>>         movl    D10(%rip), %ebp
>>         movl    D11(%rip), %r14d
>>         movl    D12(%rip), %r15d
>>         movl    D13(%rip), %r12d
>>         movl    D14(%rip), %r13d
>>         movl    D15(%rip), %eax
>>         movl    %eax, -4(%rsp)                  # 4-byte Spill
>>         movl    D16(%rip), %eax
>>         movl    %eax, -12(%rsp)                 # 4-byte Spill
>>         movl    D17(%rip), %eax
>>         movl    %eax, -16(%rsp)                 # 4-byte Spill
>>         movl    D18(%rip), %eax
>>         movl    %eax, -20(%rsp)                 # 4-byte Spill
>>         # ...  Some code ...
>>         movl    -8(%rsp), %eax                  # 4-byte Reload
>>         movl    %eax, U0(%rip)
>>         movl    %ecx, U1(%rip)
>>         movl    %edx, U2(%rip)
>>         movl    %esi, U3(%rip)
>>         movl    %edi, U4(%rip)
>>         movl    %r8d, U5(%rip)
>>         movl    %r9d, U6(%rip)
>>         movl    %r10d, U7(%rip)
>>         movl    %r11d, U8(%rip)
>>         movl    %ebx, U9(%rip)
>>         movl    %ebp, U10(%rip)
>>         movl    %r14d, U11(%rip)
>>         movl    %r15d, U12(%rip)
>>         movl    %r12d, U13(%rip)
>>         movl    %r13d, U14(%rip)
>>         movl    -4(%rsp), %eax                  # 4-byte Reload
>>         movl    %eax, U15(%rip)
>>         movl    -12(%rsp), %eax                 # 4-byte Reload
>>         movl    %eax, U16(%rip)
>>         movl    -16(%rsp), %eax                 # 4-byte Reload
>>         movl    %eax, U17(%rip)
>>         movl    -20(%rsp), %eax                 # 4-byte Reload
>>         movl    %eax, U18(%rip)
>> ```
>>
>> Meanwhile there is a lot of free space in the vector register file, yet
>> we are spilling to memory.
>>
>>
>> 2. Memory and register spills use different ports in x86
>> --------------------------------------------------------
>> According to [1] spill/reloads to/from the stack use ports 2, 3, 4, 7 on
>> a Skylake Server x86 micro-architecture, while spills/reloads to/from the
>> vector register file use Ports 0, 1, 5. This means that the stack-based and
>> vector-register-based spills/reloads use different back-end resources, and
>> by replacing one with the other can shift the overhead from some type of
>> resources to another.
>>
>> This is particularly important for Spill2Reg, because it shows that it
>> can improve performance even if spills-to-stack and spills-to-registers
>> have a similar latency. If the CPU is stalling due to over-subscribed
>> memory resources, Spill2Reg can replace some of them with
>> spills-to-registers, which can help remove some of the stalls.
>>
>>
>> I think this looks familiar
>> ===========================
>> If spilling to vector registers looks familiar, it is probably because
>> GCC has included support for spilling to registers for quite some time. In
>> x86 this was done with `vmovd` instructions spilling to `xmm` registers.
>>
>> However, to the best of my knowledge, spilling to vector registers in x86
>> has been disabled [2] due to stability [3,4,5], correctness [6], and
>> performance [7] issues.
>> The performance issues highlighted in [7] seem to be related to:
>> (i) double spilling: the vector register used for spilling get spilled to
>> memory, and
>> (ii) folded reloads: if a reload can be folded into an instruction, then
>> spilling to a vector register results in an additional instruction: the one
>> extracting the value from the vector register and inserting it into the
>> general purpose register.
>>
>> I believe that the proposed Spill2Reg design can address all these
>> issues. These points are discussed in the following sections.
>>
>>
>> Spill2Reg in x86
>> ================
>> There are several interesting points about spilling to vector registers
>> in x86:
>>
>> - Writing/Reading to vector registers can be done with a single assembly
>> instruction. So spilling to vector register does not introduce more
>> instructions.
>> - The vector register space is quite large at 2KB (32 ZMM registers X 64
>> bytes) in Skylake server, so the chances are that there will be a lot of
>> free vector register space in general purpose workloads.
>> - If needed we could insert values to any vector lane using
>> `(V)PINSR{B,W,D,Q}` and `(V)PEXTR{B,W,D,Q}` (supported since SSE4.1), or to
>> lane 0 using `MOVD` (supported since MMX).
>> - We can choose between two types of instructions for moving data to
>> vector registers: `MOVD/Q` and `PINSRD/Q`:
>>   - `PINSRD/Q` `PEXTRD/Q` allow us to insert/extract a value to/from any
>> lane of the vector register. It is implemented in SSE4_1+.
>>   - `MOVD/Q` moves a 32/64bit value to the first lane of a vector
>> register. It is implemented in MMX+ instruction sets, so it is widely
>> supported. GCC’s implementation uses `MOVD`.
>>   - According to Agner Fog's latency/throughput data for Skylake [9],
>> `MOVD` has a lower latency than `PINSRD/Q`, same latency than `MOV` to/from
>> memory, but lower throughput:
>>
>> ```
>>                   uops   uops      uops
>>                   fused  unfused   each   latency  throughput
>>                   domain domain    port
>> Spill-to-reg
>> ------------
>> MOVD mm/x r32/64    1     1        p5       2       1
>> MOVD r32/64 mm/x    1     1        p0       2       1
>>
>> PINSRD/Q x,r,i      2     2        2p5      3       2
>> PEXTRB/W/D/Q r,x,i  2     2       p0 p5     3       1
>>
>> Spill-to-stack
>> --------------
>> MOV  m,r            1     2       p237 p4   2       1
>> MOV r32/64, m       1     1        p23      2       0.5
>> ------------------------------------------------------------
>> mm  : 64-bit mmx register
>> x   : 128-bit xmm register
>> m   : memory operand
>> m32 : 32-bit memory operand
>>
>> Source: Agner Fog's Instruction tables:
>> https://www.agner.org/optimize/instruction_tables.pdf
>> ```
>>
>>
>> Spill2Reg on ARM
>> ----------------
>> According to ARM's software optimization guide [8] section 4.3 (page 57),
>> it is encouraged to spill to vector registers because these instructions
>> have a lower latency than spills to memory.
>> Also, given the RISC nature of ARM instructions, there are no folded
>> memory operands that we have to worry about when applying Spill2Reg.
>>
>>
>> Caveats
>> =======
>> - Instruction throughput/latency: Vector insert/extract instructions may
>> have lower throughput or higher latency than memory load/stores on certain
>> targets. If this is the case they should be used only when memory resources
>> are over subscribed.
>> - Double spilling: We must take extra care not to spill the vector
>> registers that hold the spilled value, because this will only add overhead.
>> This was also observed in GCC's implementation [7]. This is a concern when
>> Spill2Reg is implemented as part of the register allocator's spiller.
>> - Additional instructions: Stack based spills/reloads can potentially be
>> folded. This is quite common in x86. For example, given a reload `MOV Reg1,
>> [Spill]` and an instruction using `Reg1`, like `ADD Reg2, Reg1`, this can
>> be folded into `Add Reg2, [Spill]`, saving one instruction. Spill2Reg can
>> potentially block this folding, which will add an additional instruction
>> that reloads the data from the vector before the `ADD`. This is a concern
>> when Spill2Reg is implemented as part of the register allocator's spiller,
>> because folding of spills/reloads may be taking place later on, after all
>> spills/reloads have been emitted.
>> - Frequency throttling: Some targets will lower their turbo frequency
>> when running vector instructions [10]. Throttling is the highest for
>> AVX-512 but decreases as the vector width decreases, and it seems that
>> there is no frequency drop for 128-bit instructions.
>> - Profitability models: There may be different trade-offs for different
>> targets or different generations of the same target. Each target should
>> implement its own profitability model.
>>
>>
>> Proposed Design
>> ===============
>> Spill2Reg can be implemented either as part of the spiller, within the
>> register allocator, or as a standalone pass after register allocation. Each
>> design point has its own benefits and shortcomings. The following sections
>> list some of the most important points for and against each design. The
>> discussion provides some insights into why we believe Spill2Reg is best
>> implemented as a separate pass after the register allocator and after
>> physical registers are introduced.
>>
>> In the spiller versus as a standalone pass
>> ------------------------------------------
>>
>> ### In the spiller
>> - Pros:
>>   - The spiller is a natural location for the Spill2Reg code.
>>   - All analysis data needed are already computed and readily available.
>>   - The actual logic is quite simple.
>>
>> - Cons:
>>   - The register allocator expects that a Spill/Reload is a store/load.
>> So supporting Spill2Reg requires some refactoring.
>>   - Spill/Reload removal optimizations within the register allocator need
>> to be updated to handle spills to registers.
>>   - The register allocation pass is already quite complex, and adding
>> Spill2Reg can increase complexity even further.
>>   - Folded spills/reloads (see GCC issue [7]): We need to be able to skip
>> Spill2Reg if spills/reloads to the stack can be folded. Folding happens
>> after the spill code is emitted, so implementing this is not trivial.
>>   - Double-spills must be avoided (see GCC issue [7]): The vector pseudo
>> register that holds the spilled value needs to be colored by the register
>> allocator itself, and we need to guarantee that it won't get spilled.
>>
>> ### As a standalone pass
>> - Pros:
>>   - Small pass, easy to design and maintain.
>>   - Easier pass testing.
>>   - Straight-forward tracking of performance improvements/regressions,
>> without having to worry if changes in Spill2Reg may affect the decisions of
>> the register allocator.
>>
>> - Cons:
>>   - May replicate analysis that is already available within the register
>> allocator.
>>   - Yet one more compiler pass.
>>
>>
>> Given the points listed above, and the performance/stability bugs
>> reported in GCC [2,3,4,5,6,7], I believe that it makes sense to implement
>> Spill2Reg as a standalone pass. This results in a simpler design, with
>> simpler testing and easier performance tracking to avoid performance
>> regressions.
>>
>>
>> As a standalone-pass: Before or after VirtRegRewriter
>> -----------------------------------------------------
>> Another design decision is whether we place the pass before or after the
>> `Virtual Register Rewriter` pass.
>>
>> ### Before VirtRegRewriter
>> - Pros:
>>   - We are still working with pseudo-registers.
>>   - Straight forward checking of pseudo register interference using
>> LiveIntervals and LiveRegMatrix.
>>
>> - Cons:
>>   - The main issue is with testing. Testing Spill2Reg would require
>> running the register allocator pass first, and relying on it to generate
>> spill/reload code, which would make tests very verbose and tricky to write.
>>   - The pass would need to maintain the state of several data-structures,
>> like LiveIntervals and LiveRegMatrix.
>>
>> ### After VirtRegRewriter
>> - Pros:
>>   - The main point is ease of testing: We can test the pass in isolation,
>> using small tests containing hand-written spill/reload code. No need to run
>> other passes before it.
>>   - Fewer data structures to maintain.
>>
>> - Cons:
>>   - Spill2Reg needs its own live register tracking implementation since
>> it can no longer rely on LiveIntervals and LiveRegMatrix for finding free
>> physical registers.
>>
>>
>> Given that the pass design is quite similar in both cases, and that
>> testing is significantly nicer in one of them, the preferred option is
>> after VirtRegRewriter.
>>
>>
>> Target independent component
>> ----------------------------
>> The Spill2Reg pass works as follows:
>> 1. It collects all candidate spill/reloads. This filters out folded
>> spills/reloads and unsupported data-types by the target (target dependent
>> legality check).
>> 2. Then it iterates through the collected candidates and checks if it is
>> profitable to spill to the vector register (target dependent cost model).
>> 3. If profitable, it generates the new spills/reloads to/from the vector
>> register file and it removes the original instructions (target dependent
>> spill/reload instruction generation).
>>
>> Target dependent components
>> ---------------------------
>> This includes:
>> - Legality checks that test whether the opcodes and types of
>> spills/reloads can be handled by the target.
>> - The profitability heuristic, which checks whether we applying Spill2Reg
>> for a specific set of spills/reloads will lead to better performing code on
>> the target.
>> - The generation of spill/reload instructions to/from a vector register.
>>
>>
>> Profitability Heuristic in x86
>> ------------------------------
>> Given a candidate set of spills/reloads, we need to decide whether
>> applying Spill2Reg is more profitable. We are currently implementing an all
>> or nothing approach for the whole set: we will either replace all
>> spills/reloads or none.
>>
>> According to [9] spills to vector registers have the same latency as
>> spills to memory, but have lower throughput on Skylake. So replacing all
>> spills to memory with spills to the stack can degrade performance. Instead,
>> a better strategy is to spill to registers only when the memory units are
>> over-subscribed, in an attempt to reduce any potential back-end stalls.
>>
>> Since our goal is to avoid back-end stalls caused by bottle-necks on
>> memory resources, we need some way to measure when these stalls could
>> happen. Ideally we would query a pipeline model (like the one used in
>> instruction schedulers) to determine if spills-to-stack can cause pipeline
>> stalls. For now the implementation is based on a simple instruction count:
>> if the count of memory instructions in the proximity of the spill/reload is
>> above a threshold, then apply Spill2Reg.
>>
>> Please note that the profitability heuristic is implemented as a
>> target-dependent components, so other targets can implement their own
>> specialized heuristics.
>>
>>
>> Performance Results
>> ===================
>> The performance evaluation was done using
>> `ba51d26ec4519f5b31de3acf643264504ea7bc7c` as a base commit on a Skylake
>> Xeon Gold 6154. The code was compiled with `-O3 -march=native`.
>> Applying Spill2Reg to the code of the motivating example shown in a
>> previous section, replaces some of the spills/reloads with `movd`
>> instructions, leading to about 10% better performance.
>> Spill2Reg also works on real-life applications: In SPEC CPU 2017 525.x264
>> Spill2Reg improves performance by about 0.3%.
>>
>>
>>
>> References
>> ----------
>> [1] WikiChip Skylake microarchitecture:
>> https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
>>
>> [2] GCC spill to register:
>> https://github.com/gcc-mirror/gcc/blob/5a431b60d1f221992e5e9f7a5c032df3b5fa35a5/gcc/config/i386/i386.c#L22874
>>
>> [3] GCC stability bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70902
>>
>> [4] GCC stability bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71596
>>
>> [5] GCC stability bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71555
>>
>> [6] GCC correctness bug:
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71657
>>
>> [7] GCC performance bug:
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71453
>>
>> [8] ARM Software Optimization Guide:
>> https://developer.arm.com/documentation/swog309707/a
>>
>> [9] Agner Fog's Instruction tables:
>> https://www.agner.org/optimize/instruction_tables.pdf
>>
>> [10] Daniel Lemire's AVX throttling blog:
>> https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/
>>
>>
>> Thanks,
>> Vasileios
>>
>> _______________________________________________
>> LLVM Developers mailing listllvm-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20220127/945a8504/attachment-0001.html>