<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div></div><div><br></div><div>Reminder/Background:</div><div><br></div><div><div><br></div><div>I implemented a pass for a late machine instruction combiner that may replace an instruction sequence by combined instruction(s) when it is beneficial to do so. It provides the infrastructure to evaluate instruction combining patterns like mul+add->madd based on machine trace information. Currently the DAG Combiner greedily generates combined instructions, which usually is a win for code size, but unfortunately can cause performance losses. To remedy this the new pass changes the logic from always generate combined instruction(s) to only do so when beneficial.</div><div><br></div><div><br></div><div>The design choice was driven by the desire to make it simple to a) add new pattern and b) add support for machine combining in a target. Consequently the combiner pass comes in 3 patches: First, the target independent driver that walks all instructions of a basic block, asks the target for possible combiner pattern, evaluates each pattern by having the target generate the instruction sequence represented by the pattern and finally replaces the old code  when the new sequence is more efficient. The pattern and the new code sequence are opaque to the driver. Second, the target dependent code which currently supports only AArch64: for a given instruction it records the possible combiner pattern and on demand generates the instruction sequence it represents. Third, optional dumps the critical path length for tuning support.</div><div><br></div><div><br></div><div style="margin: 0px; font-family: Menlo;"><a href="http://reviews.llvm.org/D4367">http://reviews.llvm.org/D4367</a></div><div style="margin: 0px; font-family: Menlo;"><br></div><div><br></div><div>* Motivation + Example</div><div><br></div><div>The opportunity for this optimization is across the llvm test suite and benchmarks. </div><div><br></div><div>Specific example: SingleSource/Benchmarks/Shootout/matrix (compiled with O3 flto for AArch64 gives a >20% gain):</div><div><br></div><div>Current assembly snippet:</div><div><span style="font-family: Menlo;"><br></span></div><div><span style="font-family: Menlo;">0000000100007d24</span><span class="Apple-tab-span" style="font-family: Menlo; white-space: pre;">               </span><span style="font-family: Menlo;"><font color="#4f7a28">mul</font>      </span><span style="font-family: Menlo;">w6, w23, w6</span><span class="Apple-tab-span" style="font-family: Menlo; white-space: pre;">             </span><span style="font-family: Menlo;">// Chain of madds</span></div><div><div style="margin: 0px; font-family: Menlo;">0000000100007d28                <font color="#ff6a00">madd</font>    w5, w7, w5, w6<span class="Apple-tab-span" style="white-space: pre;">             </span>// All multiplies on critical path!</div><div style="margin: 0px; font-family: Menlo;">0000000100007d2c                ldp     w6, w7, [x4, #8]<span class="Apple-tab-span" style="white-space: pre;">   </span></div><div style="margin: 0px; font-family: Menlo;">0000000100007d30                ldr     w23, [x11, x2]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d34                <font color="#11053b">madd</font>    w5, w23, w6, w5</div><div style="margin: 0px; font-family: Menlo;">0000000100007d38                ldr     w6, [x12, x2]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d3c                madd    w5, w6, w7, w5</div><div style="margin: 0px; font-family: Menlo;">0000000100007d40                ldr     w6, [x13, x2]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d44                ldp     w7, w23, [x4, #16]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d48                madd    w5, w6, w7, w5</div><div style="margin: 0px; font-family: Menlo;"><div style="margin: 0px;">0000000100007d4c                ldr     w6, [x14, x2]</div><div style="margin: 0px;">0000000100007d50                madd    w5, w6, w23, w5</div><div style="margin: 0px;">0000000100007d54                ldr     w6, [x15, x2]</div><div style="margin: 0px;">0000000100007d58                ldp     w7, w23, [x4, #24]</div><div style="margin: 0px;">0000000100007d5c                madd    w5, w6, w7, w5</div><div style="margin: 0px;">0000000100007d60                ldr     w6, [x16, x2]</div><div style="margin: 0px;">0000000100007d64                madd    w5, w6, w23, w5</div><div style="margin: 0px;">0000000100007d68                ldr     w6, [x17, x2]</div><div style="margin: 0px;">0000000100007d6c                ldp     w7, w23, [x4, #32]</div><div style="margin: 0px;">0000000100007d70                ldr     w24, [x0, x2]</div><div style="margin: 0px;">0000000100007d74                madd    w5, w6, w7, w5</div><div><br></div></div><div style="margin: 0px; font-family: Menlo;"><span class="Apple-tab-span" style="white-space: pre;">                             </span>…</div></div><div>With machine combiner the multiplies can execute in parallel shortening the critical path (>20% gain):</div><div><br></div><div><br></div><div style="margin: 0px; font-family: Menlo;">0000000100007cf4                <font color="#4f7a28">mul</font>      w5, w7, w5<span class="Apple-tab-span" style="white-space: pre;">               </span>// Multiplies can execute in parallel</div><div style="margin: 0px; font-family: Menlo;">0000000100007cf8                ldp     w7, w23, [x4, #8]<span class="Apple-tab-span" style="white-space: pre;">        </span>// off critical path</div><div style="margin: 0px; font-family: Menlo;">0000000100007cfc                ldr     w24, [x10, x2]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d00                <font color="#ff6a00">mul</font>      w6, w24, w6</div><div style="margin: 0px; font-family: Menlo;">0000000100007d04                ldr     w24, [x11, x2]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d08                mul      w7, w24, w7</div><div style="margin: 0px; font-family: Menlo;">0000000100007d0c                ldr     w24, [x12, x2]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d10                mul      w23, w24, w23</div><div style="margin: 0px; font-family: Menlo;">0000000100007d14                ldr     w24, [x13, x2]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d18                <font color="#d95000">add </font>    w5, w6, w5</div><div style="margin: 0px; font-family: Menlo;">0000000100007d1c                ldp     w6, w25, [x4, #16]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d20                mul      w6, w24, w6</div><div style="margin: 0px; font-family: Menlo;">0000000100007d24                ldr     w24, [x14, x2]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d28                mul      w24, w24, w25</div><div style="margin: 0px; font-family: Menlo;">0000000100007d2c                ldr     w25, [x15, x2]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d30                add     w5, w7, w5</div><div style="margin: 0px; font-family: Menlo;">0000000100007d34                ldp     w7, w26, [x4, #24]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d38                mul      w7, w25, w7</div><div style="margin: 0px; font-family: Menlo;">0000000100007d3c                ldr     w25, [x16, x2]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d40                mul      w25, w25, w26</div><div style="margin: 0px; font-family: Menlo;">0000000100007d44                ldr     w26, [x17, x2]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d48                add     w5, w23, w5</div><div style="margin: 0px; font-family: Menlo;">0000000100007d4c                ldp     w23, w27, [x4, #32]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d50                mul      w23, w26, w23</div><div style="margin: 0px; font-family: Menlo;">0000000100007d54                ldr     w26, [x0, x2]</div><div style="margin: 0px; font-family: Menlo;">0000000100007d58                mul      w26, w26, w27</div><div style="margin: 0px; font-family: Menlo;">0000000100007d5c                add     w5, w6, w5</div><div><span class="Apple-tab-span" style="white-space: pre;">                                                              </span>     ….</div><div><br></div></div><div><br><div><div>On Jul 15, 2014, at 6:26 PM, Gerolf Hoflehner <<a href="mailto:ghoflehner@apple.com">ghoflehner@apple.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div style="font-size: 18px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;"><br>Changes:<br>1) Added bool alwaysCombine() (Target/TargetInstrInfo.h) so targets<br>can decide to always replace a given pattern. This should be equivalent to<br>the current code in DAGCombine when a given pattern is disabled.<br>2) InstrDepth is now a small vector (MachineCombiner.cpp)<br>3) Added helper function instr2instrSC (MachineCombiner.cpp)<br>4) Improved comments as suggested by reviewers<br><br><a href="http://reviews.llvm.org/D4367">http://reviews.llvm.org/D4367</a><br><br>Files:<br> include/llvm/CodeGen/MachineCombinerPattern.h<br> include/llvm/CodeGen/MachineTraceMetrics.h<br> include/llvm/CodeGen/Passes.h<br> include/llvm/CodeGen/TargetSchedule.h<br> include/llvm/InitializePasses.h<br> include/llvm/Target/TargetInstrInfo.h<br> lib/CodeGen/CMakeLists.txt<br> lib/CodeGen/CodeGen.cpp<br> lib/CodeGen/MachineCombiner.cpp<br> lib/CodeGen/MachineScheduler.cpp<br> lib/CodeGen/MachineTraceMetrics.cpp<br> lib/CodeGen/TargetSchedule.cpp<br> lib/Target/AArch64/AArch64InstrFormats.td<br> lib/Target/AArch64/AArch64InstrInfo.cpp<br> lib/Target/AArch64/AArch64InstrInfo.h<br> lib/Target/AArch64/AArch64TargetMachine.cpp<br> test/CodeGen/AArch64/aarch64-neon-mul-div.ll<br> test/CodeGen/AArch64/early-ifcvt.ll<br><span><D4367.11483.patch></span></div></blockquote></div><br></div></body></html>