feedback on late machine combiner pass [review]
Gerolf Hoflehner
ghoflehner at apple.com
Wed Jul 2 22:34:27 PDT 2014
Hi,
I implemented a pass for a late machine instruction combiner that may replace an instruction sequence by combined instruction(s) when it is beneficial to do so. It provides the infrastructure to evaluate instruction combining patterns like mul+add->madd based on machine trace information. Currently the DAG Combiner greedily generates combined instructions, which usually is a win for code size, but unfortunately can cause performance losses. To remedy this the new pass changes the logic from always generate combined instruction(s) to only do so when beneficial.
The design choice was driven by the desire to make it simple to a) add new pattern and b) add support for machine combining in a target. Consequently the combiner pass comes in 3 patches: First, the target independent driver that walks all instructions of a basic block, asks the target for possible combiner pattern, evaluates each pattern by having the target generate the instruction sequence represented by the pattern and finally replaces the old code when the new sequence is more efficient. The pattern and the new code sequence are opaque to the driver. Second, the target dependent code which currently supports only AArch64: for a given instruction it records the possible combiner pattern and on demand generates the instruction sequence it represents. Third, optional dumps the critical path length for tuning support.
* The patches
1) Target independent
2) Target dependent (AArch64) [ Tim? ]
3) Critical path length dumps I found useful for tuning (nice to have)
* Phabricator
http://reviews.llvm.org/D4367
Perhaps I looked at the wrong place, but I didn’t find people I thought I could send the review. Please add yourself if you are interested.
* Motivation + Example
The opportunity for this optimization is across the llvm test suite and benchmarks.
Specific example: SingleSource/Benchmarks/Shootout/matrix (compiled with O3 flto for AArch64 gives a >20% gain):
Current assembly snippet:
0000000100007d24 mul w6, w23, w6 // Chain of madds
0000000100007d28 madd w5, w7, w5, w6 // All multiplies on critical path!
0000000100007d2c ldp w6, w7, [x4, #8]
0000000100007d30 ldr w23, [x11, x2]
0000000100007d34 madd w5, w23, w6, w5
0000000100007d38 ldr w6, [x12, x2]
0000000100007d3c madd w5, w6, w7, w5
0000000100007d40 ldr w6, [x13, x2]
0000000100007d44 ldp w7, w23, [x4, #16]
0000000100007d48 madd w5, w6, w7, w5
0000000100007d4c ldr w6, [x14, x2]
0000000100007d50 madd w5, w6, w23, w5
0000000100007d54 ldr w6, [x15, x2]
0000000100007d58 ldp w7, w23, [x4, #24]
0000000100007d5c madd w5, w6, w7, w5
0000000100007d60 ldr w6, [x16, x2]
0000000100007d64 madd w5, w6, w23, w5
0000000100007d68 ldr w6, [x17, x2]
0000000100007d6c ldp w7, w23, [x4, #32]
0000000100007d70 ldr w24, [x0, x2]
0000000100007d74 madd w5, w6, w7, w5
…
With machine combiner the multiplies can execute in parallel shortening the critical path (>20% gain):
0000000100007cf4 mul w5, w7, w5 // Multiplies can execute in parallel
0000000100007cf8 ldp w7, w23, [x4, #8] // off critical path
0000000100007cfc ldr w24, [x10, x2]
0000000100007d00 mul w6, w24, w6
0000000100007d04 ldr w24, [x11, x2]
0000000100007d08 mul w7, w24, w7
0000000100007d0c ldr w24, [x12, x2]
0000000100007d10 mul w23, w24, w23
0000000100007d14 ldr w24, [x13, x2]
0000000100007d18 add w5, w6, w5
0000000100007d1c ldp w6, w25, [x4, #16]
0000000100007d20 mul w6, w24, w6
0000000100007d24 ldr w24, [x14, x2]
0000000100007d28 mul w24, w24, w25
0000000100007d2c ldr w25, [x15, x2]
0000000100007d30 add w5, w7, w5
0000000100007d34 ldp w7, w26, [x4, #24]
0000000100007d38 mul w7, w25, w7
0000000100007d3c ldr w25, [x16, x2]
0000000100007d40 mul w25, w25, w26
0000000100007d44 ldr w26, [x17, x2]
0000000100007d48 add w5, w23, w5
0000000100007d4c ldp w23, w27, [x4, #32]
0000000100007d50 mul w23, w26, w23
0000000100007d54 ldr w26, [x0, x2]
0000000100007d58 mul w26, w26, w27
0000000100007d5c add w5, w6, w5
….
* Algorithm in more detail
0. Looks for patterns within a basic block
1. For each instruction check if it can be combined with other instructions. For each combination provide a (opaque) pattern in a list.
2. For each pattern in the list generate the alternative instruction sequence. This sequence is owned by the machine function, but not hooked up to the basic block etc. This way of creating and evaluating alternative machine instructions results in negligible extra memory use.
3. Evaluate if a new pattern is more efficient. It is more efficient
a) when the new pattern has fewer instructions in Os
b) when neither critical path nor resource length increases
4. Replace the old instructions when the new sequence is more efficient. the
* Side-effects and risks
- Patterns that are evaluated in the machine combiner are no longer combined by the DAG combiner. The logic changes from “always combine” to combine only when beneficial at the machine level. Specifically for any pattern supported by the machine combiner is blocked for the DAG combiner.
- The new code sequence can increase register pressure and results in additional spill/fill code, eg. at function entry. So it can make difference whether instruction combining happens at the DAG or the machine level. The reason is that different optimizations kick in at the DAG level and the code can look different at later machine level. One interesting case I spotted is the unit test DivRem where the madd substitution results in two live ranges for the constant -3. It increases register pressure by one and gives an extra stp/ldp at entry/exit. A possible future remedy is the fusion of equivalent live ranges in special cases.
* Shortcomings
- Pattern are limited to simple combinations like mul+add. But it is possible enhance the implementation to support combining for more complex sequences mul, add , add, add,… .
- Local scope: The local basic block scope could be extended to trace scope.
- Modeling accuracy: Dynamic latencies could be significantly larger than modeled (eg. for loads) so a madd could be generated that should not be. This is a general risk with any machine scheduling technique.
* Acknowledgements
Arnold provided an early inspiring prototype. Yi’s excellent analysis outlined the potential benefits. Andy convinced me of a more general critical path analysis.
Thanks
Gerolf
On Jun 2, 2014, at 3:41 PM, Gerolf Hoflehner <ghoflehner at apple.com> wrote:
> Hi,
>
> we noticed that combining instructions at the ISEL level can result in sub-optimal schedules, since an expensive instruction could be folded and lengthen the critical path. A simple illustrative example for this is folding mul-mul-add into mul-madd (fused multiply add), which serializes two multiplications instead of allowing them to execute in parallel. Unfortunately ISEL does not have the information necessary to avoid sub-optimial cases in all instances. The alternative is to post-pone instruction combining to a later pass and decide on whether the combined code sequence performs better. This combiner pass would run at the machine IR level, use the machine trace model information, and roughly implement the following architecture independent algorithm initially targeting mul-add/sub pattern:
>
> * Initial Instruction Combiner Algorithm
> 0. Recognize instruction patterns in a single basic block
> 1. For each add/sub check if it can be combined to madd/msub
> 2. For each pattern found, generate the alternative instruction sequence outside the basic block that contains mul - add/sub
> 3. Evaluate if the new pattern is more efficient:
> a) In Os substitute when the new pattern has fewer instructions
> b) Otherwise substitute when neither critical path nor resource length increases based on the machine trace model
>
> Do you have any concerns? Or is this the pass you always wanted? :-)
>
> Cheers
> Gerolf
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140702/01af3d29/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MC_TI.patch
Type: application/octet-stream
Size: 91767 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140702/01af3d29/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140702/01af3d29/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MC_AArch64.patch
Type: application/octet-stream
Size: 21051 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140702/01af3d29/attachment-0001.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140702/01af3d29/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cpl_dump.patch
Type: application/octet-stream
Size: 1977 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140702/01af3d29/attachment-0002.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140702/01af3d29/attachment-0003.html>
More information about the llvm-commits
mailing list