<html><body><div style="color:#000; background-color:#fff; font-family:arial, helvetica, sans-serif;font-size:12pt"><div><span>Hi Andy,</span></div><div><br><span></span></div><div><span>Here are more details about the performance difference that we have seen on the gromacs benchmark. We have confirmed that the performance difference is due to the scheduling of the largest basic block (the one with 315 instructions and LLVM DAG ID: inlll30_:bb1) in gromacs's hottest function inlll30_ located in the source file innerf.f. <br></span></div><div><span>We compile using llvm-gfortran with the following command-line options: <br></span></div><div><span>-O3 -march=core2 -mtune=core2</span></div><div><span>I am not sure if the second and third options make any difference on x86, do they?<br></span></div><div>llvm-gfortran is linked to LLVM 2.9.<br></div><div><br></div><div>I am attaching both the assembly code and the schedule  for each of LLVM's ILP

 scheduler and our scheduler (opt). The cycle numbers in the schedules assume that we are scheduling for a 

single-issue machine, which is essentially the trivial machine model 

that we schedule for when we do not have functional-unit information, right? <br></div><div>By looking at the asm code, it is hard for me to identify the cause of the performance difference. However, one observation that I can make is that LLVM's ILP scheduler schedules 8 SQRT instructions next to each other in Cycles 114 to 122, while our scheduler places them apart from each other. When LLVM's ILP scheduler is used, we get a score of 11.2, while when our scheduler is used, we get a score of 12.5, a 12% speed up.<br></div><div><br></div><div>I have a senior student who will be applying a performance analysis tool to this benchmark. At this point, however, he is still learning how to use the performance analysis tool. He is also learning LLVM's itinerary format in order to add an x86 itinerary and measure the performance impact of that. He will be doing that as part of his senior project. However, we will probably need your consultation to fully

 understand the itinerary format and come up with a reasonable machine model for x86. BTW, does LLVM's itinerary format have a way of telling the scheduler that DIV and SQRT instructions are un-pipelined? If yes, how does the scheduler handle un-pipelined instructions?<br><span></span></div><div><br><span></span></div><div><span>Regards</span></div><div><span>-Ghassan</span></div><div><br></div><div style="font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"><font face="Arial" size="2"><hr size="1"><b><span style="font-weight: bold;">From:</span></b> Ghassan Shobaki <ghassan_shobaki@yahoo.com><br><b><span style="font-weight: bold;">To:</span></b> Andrew Trick <atrick@apple.com><br><b><span style="font-weight: bold;">Cc:</span></b> "llvmdev@cs.uiuc.edu" <llvmdev@cs.uiuc.edu><br><b><span style="font-weight: bold;">Sent:</span></b> Monday, September 26,

 2011 10:57 AM<br><b><span style="font-weight: bold;">Subject:</span></b> Re: [LLVMdev] Pre-Allocation Schedulers in LLVM<br></font><br>

<div id="yiv1076449574"><div><div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div><span>Hi Andy,</span></div><div><br><span></span></div><div><span>Please see my in-line answers below.</span></div><div><br><span></span></div><div><span>Regards</span></div><div><span><br></span></div><div><span>-Ghassan</span></div><div><br></div><div style="font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"><font face="Arial" size="2"><hr size="1"><b><span style="font-weight: bold;">From:</span></b> Andrew Trick <atrick@apple.com><br><b><span style="font-weight: bold;">To:</span></b> Ghassan Shobaki <ghassan_shobaki@yahoo.com><br><b><span style="font-weight: bold;">Cc:</span></b> "llvmdev@cs.uiuc.edu" <llvmdev@cs.uiuc.edu><br><b><span style="font-weight: bold;">Sent:</span></b> Friday,

 September 23, 2011 8:02 PM<br><b><span style="font-weight: bold;">Subject:</span></b> Re: [LLVMdev] Pre-Allocation Schedulers in LLVM<br></font><br><div id="yiv1076449574"><br><div><div>On Sep 23, 2011, at 6:16 AM, Ghassan Shobaki wrote:</div><br class="yiv1076449574Apple-interchange-newline"><blockquote type="cite"><div><div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div><span>Hi Andrew,</span></div><div><br></div><div>What we have is not a patch to any of LLVM's schedulers. We have implemented our own scheduler and integrated it into LLVM 2.9 as yet-another scheduler. Our scheduler uses a combinatorial optimization approach to balance ILP and register pressure. In one experiment, we added more precise latency information for most common x86 instructions to our scheduler and noticed a 10% performance improvement on one FP2006 benchmark, namely gromacs. More precisely, we

 compared:</div><div><br></div><div>(1)

 LLVM2.9+ourScheduler+preciseLatency <br></div><div>against <br></div><div>(2) LLVM2.9+ourScheduler+LLVM's-rough-1-10-latency-model <br></div><div><br></div><div>And (1) was faster than (2) by 10%. We concluded that adding precise latency information may significantly improve the performance of

 programs like gromacs, which have a high degree of ILP. However, we did not do any performance analysis to verify that the performance improvement is indeed due to improving ILP scheduling (reducing pipeline stalls) not due to some other factor, such as memory performance, which may have gotten coincidentally improved by the reordering.</div></div></div></blockquote><div><br></div><div>Thanks for the clear explanation. How much of LLVM's existing scheduler framework do you use? Do you schedule the selection DAG? Are  you using the same scheduling DAG built by the existing scheduler?<br><br>Ghassan: We are using LLVM's DAG but we convert it to our own format. Our scheduler has been originally developed as a standalone scheduler that reads its input DAGs from a text file. We currently have a little software layer that converts LLVM's DAG into our DAG data structure. Note that what we have now is a research prototype. The work may be productized later

 if it gives significantly better performance with an acceptable increase in compile time and if there is mutual interest in productizing it.    <br></div><div><br></div><blockquote type="cite"><div><div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div style="font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div style="font-family: times,serif; font-size: 12pt;"><font face="Arial" size="2"></font></div></div></div></div></blockquote><blockquote type="cite"><div><div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div>We are not currently running any performance analysis tools, but if you are interested, we can identify the code section whose reordering has led to this performance improvement and send you the generated code in both cases. We can probably narrow it down to one

 basic block fairly easily, because our scheduler only operates on the hottest 2 functions in that benchmark and those have a total of 18 basic blocks. BTW, do you guys run SPEC CPU2006? If not, which benchmarks do you use for evaluating LLVM's performance? <br></div></div></div></blockquote><div><br></div>I run all the benchmarks in <a rel="nofollow" target="_blank" href="http://llvm.org/svn/llvm-project/test-suite/trunk">llvm.org/svn/llvm-project/test-suite/trunk</a>. You can see that the "External" tests invoke the major SPEC CPU benchmark suites.</div><div><br><blockquote type="cite"><div><div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div>BTW, our scheduler's performance is about 3%

 faster than LLVM's ILP scheduler on that benchmark when it uses LLVM's 1-10 latency model. So, with the precise latency info, our scheduler is about 13% faster than LLVM's ILP scheduler on that benchmark.</div></div></div></blockquote><div><br></div><div>Great. I'll be interested to see the details in your paper. We have not specifically tuned for SPEC CPU 2006.</div><div><br></div>I'm more interested in any insights you've gained while evaluating your scheduler's performance. It's one thing to measure speedup and another to understand it. And for me, the source of scheduling impact on x86 has been elusive. Are you using a machine-level profiling tool such as VTune?<br><br>Ghassan: We are currently looking into the gromacs case trying to identify the cause of the improved performance. We have been able to narrow it down to one basic block, but it is a relatively large one (~300 instructions). We are not currently using any performance analysis tool, but

 this test case may give us the motivation to start doing that.<br><br>Our work on pre-allocation instruction scheduling with register pressure minimization will be published soon, but if you would like to get an idea about my previous work on scheduling using branch-and-bound enumeration, you can find it on our school's research web page:<br><br>http://www.psut.edu.jo/research/index.html?fs=publications/drpub/ghassanshobakic.htm<br> 

<br>Since the approach is fundamentally different from the heuristic approach used in LLVM's scheduler (and probably all schedulers found in production compilers), it will be hard to gain an interesting insight through a top-down analysis. In a heuristic approach, you are trying to come up with a prioritized set of rules for *guessing* the best choice, while in a branch-and-bound approach, you are trying to come up with a set of pruning techniques to make your exhaustive search as efficient as possible. However, a bottom-up analysis that examines particular test cases (like what we are currently trying to do with gromacs) may give some interesting insights. BTW, there are two other test cases for which our scheduler found significantly better schedules than LLVM's scheduler using LLVM's latencies (lbm on x86-64 (21%) and milc on x86-32 (13%)).  I will keep you updated with any findings that we may come up with.  <br> <br><blockquote

 type="cite"><div><div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div>What may be more interesting for you though is finding out whether adding more precise latency information improves the performance of LLVM's ILP scheduler. We are interested in answering that question too, and are willing to add an x86 machine model to LLVM if that task does not turn out to be too complicated. So, how easy will it be to add an x86 itinerary? Can you point me to the file or files that have to be added or changed? Where can I find an example of an existing itinerary for some other target?</div></div></div></blockquote><div><br></div><div>ARM itineraries are in lib/Target/ARM/ARMScheduleV?.td. A description of the itinerary format is in include/llvm/Target/TargetSchedule.td.</div><div><div><br></div><div>It's easy to add a new itinerary, but hard to specify the behavior of x86

 microarchitecture to put it mildly.The itinerary format is quite detailed and verbose as a result of being target independent. It's reasonably powerful, but you have to play tricks such as defining fake functional units to express constraints. You may run into some difficulty with micro-ops, but let's worry about that later.</div><div><br></div><div>Fortunately, latency is modeled independent of pipeline and functional units and much simpler to deal with. You won't be able to precisely specify the latency of the bypass network, but I wouldn't be concerned with that level of detail.</div><div><br></div></div><blockquote type="cite"><div><div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div>BTW, In the above described experiment, we have only added latency information; we did not add functional-unit information. However, we are interested in adding both, but we would like to

 do it in

 two steps (first latency only then functional unit info) and measure the impact of each step.</div></div></div></blockquote><div><br></div>Good.</div><div><br></div><div>Thanks,</div><div>-Andy</div><div><br><div></div><blockquote type="cite"><div><div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div style="font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"><font face="Arial" size="2"><b><span style="font-weight: bold;">From:</span></b> Andrew Trick <<a rel="nofollow" ymailto="mailto:atrick@apple.com" target="_blank" href="mailto:atrick@apple.com">atrick@apple.com</a>><br><b><span style="font-weight: bold;">To:</span></b> Ghassan Shobaki <<a rel="nofollow" ymailto="mailto:ghassan_shobaki@yahoo.com" target="_blank"

 href="mailto:ghassan_shobaki@yahoo.com">ghassan_shobaki@yahoo.com</a>><br><b><span style="font-weight: bold;">Cc:</span></b> "<a rel="nofollow" ymailto="mailto:llvmdev@cs.uiuc.edu" target="_blank" href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a>" <<a rel="nofollow" ymailto="mailto:llvmdev@cs.uiuc.edu" target="_blank" href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a>><br><b><span style="font-weight: bold;">Sent:</span></b> Wednesday, September 21, 2011 6:54 AM<br><b><span style="font-weight: bold;">Subject:</span></b> Re: [LLVMdev] Pre-Allocation Schedulers in LLVM<br></font><br><div id="yiv1076449574"><div><div>On Sep 17, 2011, at 10:07 AM, Ghassan

 Shobaki wrote:</div><blockquote type="cite"><div><div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div>Hi,</div><div><br></div><div>I am currently writing a paper 

documenting a research project that we have done on pre-allocation 

instruction scheduling to balance ILP and register pressure. In the 

paper we compare the pre-allocation scheduler that we have developed to 

LLVM's default schedulers for two targets: x86-64 and x86-32. We would 

like to include in our paper some brief descriptions of the two LLVM 

schedulers that we are comparing against and some information about the 

machine model that they are scheduling for.  So, it would be great if 

you could confirm or correct the following information and answer my 

questions below:<br></div><div><br> </div><div>The default scheduler for

 the x86-32 target is the bottom-up register-pressure reduction (BURR) 

scheduler, while for the x86-64 target it is the ILP Scheduler. 

According to the

 brief documentation in the 

source file ScheduleDAGRRList, the BURR is a register pressure reduction

 scheduler, while the ILP is a register-pressure aware scheduler that 

tries to balance ILP and register pressure. <br></div></div></div></blockquote><div><br></div><div>Yes. For those wondering how to find out, grep for 'setSchedulingPreference'.</div><div><br></div><blockquote type="cite"><div><div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div>

My questions are:</div><div><br></div><div>-Are there any references (such as published research) that describe each/any of these scheduling algorithms? </div></div></div></blockquote><div><br></div><div>The LLVM ILP scheduler is a mix of heuristics, some of which may be standard practice while others are LLVM inventions. The combination of heuristics can only be described as ad hoc.</div><div><br></div><blockquote type="cite"><div><div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div>-

 By examining the source code, it appears that neither scheduler has a 

machine model describing the functional units and the mapping of 

instructions to functional units on the target x86 machine. Is that 

right?</div></div></div></blockquote><div><br></div><div>The x86 LLVM target does not have a machine pipeline model (instruction itinerary).</div><div><br></div><blockquote type="cite"><div><div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div>- Based on the test cases that I have 

analyzed, it looks that the BURR scheduler sets all latencies to 1, 

which essentially eliminates any scheduling for ILP and makes scheduling

 for register pressure reduction the only objective of this scheduler. 

Can you please confirm or correct this?</div></div></div></blockquote><div><br></div><div>You're obviously wondering why this is called an "ILP" scheduler, which is understandable. I would even argue that the term "scheduler" is not entirely accurate.</div><div><br></div><div>At any rate, this scheduler does not directly increase IPC by avoiding pipeline stalls as you might expect. What we mean by ILP scheduling is increasing the number of "in-flight" instructions. This is often the opposite of register pressure. The key to this is carefully tracking register pressure (locally) and dependence height. The heuristics decide whether an instruction is likely to increase register pressure. If not, then we attempt to overlap chains of dependent instructions.</div><div><br></div><div>Note that LLVM's "hybrid" scheduler works much the same as the ILP scheduler but uses a mix of heuristics that happens to work better for targets with an intruction

 itinerary.</div><div><br></div><blockquote type="cite"><div><div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div>-

 Again based on analyzing test cases, it appears that the ILP scheduler 

sets the latencies of DIV and SQRT (both INT and FP) to 10, while the 

latencies of all other instructions are set to 10. Can you please 

confirm or correct

 this observation?</div></div></div></blockquote><div><br></div><div>The scheduler is aware that certain operations are "very high latency", but frankly that knowledge isn't crucial. 10 is used, because we want a number much greater than 1, and within reasonable expected latency. Without a pipeline model, it just doesn't make much difference.</div><br><blockquote type="cite"><div><div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div>Apparently, the developers 

of the ILP scheduler assumed that this rough latency model would be 

sufficient to do ILP scheduling on the x86 target, because the x86 

hardware has a good dynamic scheduler. Our testing, however, shows that 

this is the case for most but not all programs. For one particular 

benchmark with a high-degree of ILP, using more precise latency info 

significantly improved performance. Will the LLVM developers be 

interested in adding more precise latency info for the x86 target?</div></div></div></blockquote><div><br></div><div>Free performance? Sure. If you're talking about adding an instruction itinerary for X86, that should be no problem. A major change to the scheduler may be harder to swallow. Either way, it would be nice to see the patch and benchmark data or specific examples of code sequences that improve.</div><div><br></div><div>Thanks,</div><div>-Andy</div></div></div><br><br></div></div></div></div></blockquote></div><br></div><br><br></div></div></div></div></div><br><br></div></div></div></body></html>