[llvm-dev] [RFC] Spill2Reg: Selectively replace spills to stack with spills to vector registers
Vasileios Porpodas via llvm-dev
llvm-dev at lists.llvm.org
Wed Jan 26 23:34:56 PST 2022
Thanks Philip for the feedback, I am glad you enjoyed reading it.
Vasileios
On Wed, Jan 26, 2022 at 8:29 PM Philip Reames <listmail at philipreames.com>
wrote:
> This is a wonderful writeup of an interesting design problem, and the
> chosen solution. I learned several things from reading it. Thank you
> taking the time to write this up.
>
> My takeaway after reading was that you've clearly thought through all the
> issues in some depth. I might not lean the same direction on each choice,
> but you've more than convinced me that you've thought about them and made a
> reasoned choice. Solid +1 on the design here.
>
> Philip
> On 1/26/22 6:57 PM, Vasileios Porpodas via llvm-dev wrote:
>
> Hi everyone,
>
> This is an RFC for a new machine IR pass called Spill2Reg, which
> selectively replaces spills to the stack with spills to vector registers.
> The chain of patches can be found here: https://reviews.llvm.org/D118298
> (it includes all patches from D118298 to D118305).
>
> ### Overview
> The register allocator emits spill/reload instructions to temporarily save
> register values to memory. These are typically stores to the stack (aka
> spills) and loads from the stack (aka reloads or fills). These instructions
> hurt performance not only because they increase the number of instructions
> executed, but also because they add pressure to the memory resources, which
> may already be heavily used by the actual memory instructions of the
> workload.
>
> Spill2Reg aims at reducing this overhead by selectively replacing
> spills/reloads to/from the stack with spills/reloads to/from registers only
> when it is profitable. The target registers that will hold the spilled
> value must be of a different class than those that caused the spill, and in
> architectures like x86 or ARM we can use vector registers to save values
> from general purpose registers. Spill2reg can be profitable in a couple of
> cases: (i) on targets where spills to stack are always slower than spills
> to registers, (ii) in pathological cases with lots of spill/reload
> instructions back-to-back, (iii) in memory intensive workloads. It is worth
> pointing out that Spill2Reg can be profitable even on targets where spills
> to registers and spills to stack have a similar latency. This is because
> replacing some of the stack instructions with register instructions can
> help remove some stalls caused by bottle-necks in the memory resources.
>
> Early evaluation on a Skylake(Server) x86_64 system shows that Spill2Reg
> can improve performance of both synthetic tests and of real-life workloads.
>
>
> Why Spill to Registers?
> =======================
> There are a couple of reasons why it makes sense to spill to registers
> instead of memory. To summarize: (i) there is usually a lot of free
> register space even when spilling and (ii) spilling to vector registers can
> remove back-end stalls. The following sections discuss these points in more
> detail.
>
>
> 1. Free register space even when spilling
> -----------------------------------------
> Consider the following code:
> ```
> int D0, D1, D2, ..., D18;
> foo() {
> int t0 = D0
> int t1 = D1
> ...
> int t18 = D18
> // Some code
> ... = t0
> ... = t1
> ...
> ... = t18
> }
> ```
>
> Variables t0 to t18 are all live across the middle point (marked with `//
> Some code`). When compiled for x86_64, this code will assign t0 to t14 to
> registers, but will spill t15 to t18.
> Here is what the assembly looks like:
> ```
> movl D0(%rip), %eax
> movl %eax, -8(%rsp) # 4-byte Spill
> movl D1(%rip), %ecx
> movl D2(%rip), %edx
> movl D3(%rip), %esi
> movl D4(%rip), %edi
> movl D5(%rip), %r8d
> movl D6(%rip), %r9d
> movl D7(%rip), %r10d
> movl D8(%rip), %r11d
> movl D9(%rip), %ebx
> movl D10(%rip), %ebp
> movl D11(%rip), %r14d
> movl D12(%rip), %r15d
> movl D13(%rip), %r12d
> movl D14(%rip), %r13d
> movl D15(%rip), %eax
> movl %eax, -4(%rsp) # 4-byte Spill
> movl D16(%rip), %eax
> movl %eax, -12(%rsp) # 4-byte Spill
> movl D17(%rip), %eax
> movl %eax, -16(%rsp) # 4-byte Spill
> movl D18(%rip), %eax
> movl %eax, -20(%rsp) # 4-byte Spill
> # ... Some code ...
> movl -8(%rsp), %eax # 4-byte Reload
> movl %eax, U0(%rip)
> movl %ecx, U1(%rip)
> movl %edx, U2(%rip)
> movl %esi, U3(%rip)
> movl %edi, U4(%rip)
> movl %r8d, U5(%rip)
> movl %r9d, U6(%rip)
> movl %r10d, U7(%rip)
> movl %r11d, U8(%rip)
> movl %ebx, U9(%rip)
> movl %ebp, U10(%rip)
> movl %r14d, U11(%rip)
> movl %r15d, U12(%rip)
> movl %r12d, U13(%rip)
> movl %r13d, U14(%rip)
> movl -4(%rsp), %eax # 4-byte Reload
> movl %eax, U15(%rip)
> movl -12(%rsp), %eax # 4-byte Reload
> movl %eax, U16(%rip)
> movl -16(%rsp), %eax # 4-byte Reload
> movl %eax, U17(%rip)
> movl -20(%rsp), %eax # 4-byte Reload
> movl %eax, U18(%rip)
> ```
>
> Meanwhile there is a lot of free space in the vector register file, yet we
> are spilling to memory.
>
>
> 2. Memory and register spills use different ports in x86
> --------------------------------------------------------
> According to [1] spill/reloads to/from the stack use ports 2, 3, 4, 7 on a
> Skylake Server x86 micro-architecture, while spills/reloads to/from the
> vector register file use Ports 0, 1, 5. This means that the stack-based and
> vector-register-based spills/reloads use different back-end resources, and
> by replacing one with the other can shift the overhead from some type of
> resources to another.
>
> This is particularly important for Spill2Reg, because it shows that it can
> improve performance even if spills-to-stack and spills-to-registers have a
> similar latency. If the CPU is stalling due to over-subscribed memory
> resources, Spill2Reg can replace some of them with spills-to-registers,
> which can help remove some of the stalls.
>
>
> I think this looks familiar
> ===========================
> If spilling to vector registers looks familiar, it is probably because GCC
> has included support for spilling to registers for quite some time. In x86
> this was done with `vmovd` instructions spilling to `xmm` registers.
>
> However, to the best of my knowledge, spilling to vector registers in x86
> has been disabled [2] due to stability [3,4,5], correctness [6], and
> performance [7] issues.
> The performance issues highlighted in [7] seem to be related to:
> (i) double spilling: the vector register used for spilling get spilled to
> memory, and
> (ii) folded reloads: if a reload can be folded into an instruction, then
> spilling to a vector register results in an additional instruction: the one
> extracting the value from the vector register and inserting it into the
> general purpose register.
>
> I believe that the proposed Spill2Reg design can address all these issues.
> These points are discussed in the following sections.
>
>
> Spill2Reg in x86
> ================
> There are several interesting points about spilling to vector registers in
> x86:
>
> - Writing/Reading to vector registers can be done with a single assembly
> instruction. So spilling to vector register does not introduce more
> instructions.
> - The vector register space is quite large at 2KB (32 ZMM registers X 64
> bytes) in Skylake server, so the chances are that there will be a lot of
> free vector register space in general purpose workloads.
> - If needed we could insert values to any vector lane using
> `(V)PINSR{B,W,D,Q}` and `(V)PEXTR{B,W,D,Q}` (supported since SSE4.1), or to
> lane 0 using `MOVD` (supported since MMX).
> - We can choose between two types of instructions for moving data to
> vector registers: `MOVD/Q` and `PINSRD/Q`:
> - `PINSRD/Q` `PEXTRD/Q` allow us to insert/extract a value to/from any
> lane of the vector register. It is implemented in SSE4_1+.
> - `MOVD/Q` moves a 32/64bit value to the first lane of a vector
> register. It is implemented in MMX+ instruction sets, so it is widely
> supported. GCC’s implementation uses `MOVD`.
> - According to Agner Fog's latency/throughput data for Skylake [9],
> `MOVD` has a lower latency than `PINSRD/Q`, same latency than `MOV` to/from
> memory, but lower throughput:
>
> ```
> uops uops uops
> fused unfused each latency throughput
> domain domain port
> Spill-to-reg
> ------------
> MOVD mm/x r32/64 1 1 p5 2 1
> MOVD r32/64 mm/x 1 1 p0 2 1
>
> PINSRD/Q x,r,i 2 2 2p5 3 2
> PEXTRB/W/D/Q r,x,i 2 2 p0 p5 3 1
>
> Spill-to-stack
> --------------
> MOV m,r 1 2 p237 p4 2 1
> MOV r32/64, m 1 1 p23 2 0.5
> ------------------------------------------------------------
> mm : 64-bit mmx register
> x : 128-bit xmm register
> m : memory operand
> m32 : 32-bit memory operand
>
> Source: Agner Fog's Instruction tables:
> https://www.agner.org/optimize/instruction_tables.pdf
> ```
>
>
> Spill2Reg on ARM
> ----------------
> According to ARM's software optimization guide [8] section 4.3 (page 57),
> it is encouraged to spill to vector registers because these instructions
> have a lower latency than spills to memory.
> Also, given the RISC nature of ARM instructions, there are no folded
> memory operands that we have to worry about when applying Spill2Reg.
>
>
> Caveats
> =======
> - Instruction throughput/latency: Vector insert/extract instructions may
> have lower throughput or higher latency than memory load/stores on certain
> targets. If this is the case they should be used only when memory resources
> are over subscribed.
> - Double spilling: We must take extra care not to spill the vector
> registers that hold the spilled value, because this will only add overhead.
> This was also observed in GCC's implementation [7]. This is a concern when
> Spill2Reg is implemented as part of the register allocator's spiller.
> - Additional instructions: Stack based spills/reloads can potentially be
> folded. This is quite common in x86. For example, given a reload `MOV Reg1,
> [Spill]` and an instruction using `Reg1`, like `ADD Reg2, Reg1`, this can
> be folded into `Add Reg2, [Spill]`, saving one instruction. Spill2Reg can
> potentially block this folding, which will add an additional instruction
> that reloads the data from the vector before the `ADD`. This is a concern
> when Spill2Reg is implemented as part of the register allocator's spiller,
> because folding of spills/reloads may be taking place later on, after all
> spills/reloads have been emitted.
> - Frequency throttling: Some targets will lower their turbo frequency when
> running vector instructions [10]. Throttling is the highest for AVX-512 but
> decreases as the vector width decreases, and it seems that there is no
> frequency drop for 128-bit instructions.
> - Profitability models: There may be different trade-offs for different
> targets or different generations of the same target. Each target should
> implement its own profitability model.
>
>
> Proposed Design
> ===============
> Spill2Reg can be implemented either as part of the spiller, within the
> register allocator, or as a standalone pass after register allocation. Each
> design point has its own benefits and shortcomings. The following sections
> list some of the most important points for and against each design. The
> discussion provides some insights into why we believe Spill2Reg is best
> implemented as a separate pass after the register allocator and after
> physical registers are introduced.
>
> In the spiller versus as a standalone pass
> ------------------------------------------
>
> ### In the spiller
> - Pros:
> - The spiller is a natural location for the Spill2Reg code.
> - All analysis data needed are already computed and readily available.
> - The actual logic is quite simple.
>
> - Cons:
> - The register allocator expects that a Spill/Reload is a store/load. So
> supporting Spill2Reg requires some refactoring.
> - Spill/Reload removal optimizations within the register allocator need
> to be updated to handle spills to registers.
> - The register allocation pass is already quite complex, and adding
> Spill2Reg can increase complexity even further.
> - Folded spills/reloads (see GCC issue [7]): We need to be able to skip
> Spill2Reg if spills/reloads to the stack can be folded. Folding happens
> after the spill code is emitted, so implementing this is not trivial.
> - Double-spills must be avoided (see GCC issue [7]): The vector pseudo
> register that holds the spilled value needs to be colored by the register
> allocator itself, and we need to guarantee that it won't get spilled.
>
> ### As a standalone pass
> - Pros:
> - Small pass, easy to design and maintain.
> - Easier pass testing.
> - Straight-forward tracking of performance improvements/regressions,
> without having to worry if changes in Spill2Reg may affect the decisions of
> the register allocator.
>
> - Cons:
> - May replicate analysis that is already available within the register
> allocator.
> - Yet one more compiler pass.
>
>
> Given the points listed above, and the performance/stability bugs reported
> in GCC [2,3,4,5,6,7], I believe that it makes sense to implement Spill2Reg
> as a standalone pass. This results in a simpler design, with simpler
> testing and easier performance tracking to avoid performance regressions.
>
>
> As a standalone-pass: Before or after VirtRegRewriter
> -----------------------------------------------------
> Another design decision is whether we place the pass before or after the
> `Virtual Register Rewriter` pass.
>
> ### Before VirtRegRewriter
> - Pros:
> - We are still working with pseudo-registers.
> - Straight forward checking of pseudo register interference using
> LiveIntervals and LiveRegMatrix.
>
> - Cons:
> - The main issue is with testing. Testing Spill2Reg would require
> running the register allocator pass first, and relying on it to generate
> spill/reload code, which would make tests very verbose and tricky to write.
> - The pass would need to maintain the state of several data-structures,
> like LiveIntervals and LiveRegMatrix.
>
> ### After VirtRegRewriter
> - Pros:
> - The main point is ease of testing: We can test the pass in isolation,
> using small tests containing hand-written spill/reload code. No need to run
> other passes before it.
> - Fewer data structures to maintain.
>
> - Cons:
> - Spill2Reg needs its own live register tracking implementation since it
> can no longer rely on LiveIntervals and LiveRegMatrix for finding free
> physical registers.
>
>
> Given that the pass design is quite similar in both cases, and that
> testing is significantly nicer in one of them, the preferred option is
> after VirtRegRewriter.
>
>
> Target independent component
> ----------------------------
> The Spill2Reg pass works as follows:
> 1. It collects all candidate spill/reloads. This filters out folded
> spills/reloads and unsupported data-types by the target (target dependent
> legality check).
> 2. Then it iterates through the collected candidates and checks if it is
> profitable to spill to the vector register (target dependent cost model).
> 3. If profitable, it generates the new spills/reloads to/from the vector
> register file and it removes the original instructions (target dependent
> spill/reload instruction generation).
>
> Target dependent components
> ---------------------------
> This includes:
> - Legality checks that test whether the opcodes and types of
> spills/reloads can be handled by the target.
> - The profitability heuristic, which checks whether we applying Spill2Reg
> for a specific set of spills/reloads will lead to better performing code on
> the target.
> - The generation of spill/reload instructions to/from a vector register.
>
>
> Profitability Heuristic in x86
> ------------------------------
> Given a candidate set of spills/reloads, we need to decide whether
> applying Spill2Reg is more profitable. We are currently implementing an all
> or nothing approach for the whole set: we will either replace all
> spills/reloads or none.
>
> According to [9] spills to vector registers have the same latency as
> spills to memory, but have lower throughput on Skylake. So replacing all
> spills to memory with spills to the stack can degrade performance. Instead,
> a better strategy is to spill to registers only when the memory units are
> over-subscribed, in an attempt to reduce any potential back-end stalls.
>
> Since our goal is to avoid back-end stalls caused by bottle-necks on
> memory resources, we need some way to measure when these stalls could
> happen. Ideally we would query a pipeline model (like the one used in
> instruction schedulers) to determine if spills-to-stack can cause pipeline
> stalls. For now the implementation is based on a simple instruction count:
> if the count of memory instructions in the proximity of the spill/reload is
> above a threshold, then apply Spill2Reg.
>
> Please note that the profitability heuristic is implemented as a
> target-dependent components, so other targets can implement their own
> specialized heuristics.
>
>
> Performance Results
> ===================
> The performance evaluation was done using
> `ba51d26ec4519f5b31de3acf643264504ea7bc7c` as a base commit on a Skylake
> Xeon Gold 6154. The code was compiled with `-O3 -march=native`.
> Applying Spill2Reg to the code of the motivating example shown in a
> previous section, replaces some of the spills/reloads with `movd`
> instructions, leading to about 10% better performance.
> Spill2Reg also works on real-life applications: In SPEC CPU 2017 525.x264
> Spill2Reg improves performance by about 0.3%.
>
>
>
> References
> ----------
> [1] WikiChip Skylake microarchitecture:
> https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
>
> [2] GCC spill to register:
> https://github.com/gcc-mirror/gcc/blob/5a431b60d1f221992e5e9f7a5c032df3b5fa35a5/gcc/config/i386/i386.c#L22874
>
> [3] GCC stability bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70902
>
> [4] GCC stability bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71596
>
> [5] GCC stability bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71555
>
> [6] GCC correctness bug:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71657
>
> [7] GCC performance bug:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71453
>
> [8] ARM Software Optimization Guide:
> https://developer.arm.com/documentation/swog309707/a
>
> [9] Agner Fog's Instruction tables:
> https://www.agner.org/optimize/instruction_tables.pdf
>
> [10] Daniel Lemire's AVX throttling blog:
> https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/
>
>
> Thanks,
> Vasileios
>
> _______________________________________________
> LLVM Developers mailing listllvm-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20220126/0eb0abed/attachment-0001.html>
More information about the llvm-dev
mailing list