<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><div><div>On Jul 22, 2013, at 11:50 AM, Tom Stellard <<a href="mailto:tom@stellard.net">tom@stellard.net</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">Hi,<br><br>I'm working on defining a SchedMachineModel for the Southern Islands<br>family of GPUs, and I have two questions related to the<br>MachineScheduler.<br><br>1. I have a resource that can process 15 instructions at the same time.<br>In the TableGen definitions, should I do:<br><br>def HWVMEM : ProcResource<15>;</div></blockquote><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">or<br><br>let BufferSize = 15 in {<br>def HWVMEM : ProcResource<1>;<br>}<br></div></blockquote><div><br></div><div>For in-order processors you always want BufferSize=0. In the current generic scheduler (ConvergingScheduler) it's effectively a boolean that specifies inorder vs OOO. (I have code that models the buffers in an OOO processor, but I think it’s too heavy-weight to go in the scheduler. Maybe someday it can be an analysis tool.)</div><div><br></div><div>let BufferSize = 0 {</div><div>def HWVMEM : ProcResource<15>;</div><div><blockquote type="cite" dir="auto"></blockquote></div>}</div><div><br></div><div>Now since you’ll want to plugin your own scheduling strategy, how you interpret the machine model is mostly up to you. What the TargetSchedModel interface does for you is normalize the resources to processor cycles. This is exposed with scaling factors (to avoid division): getResourceFactor, getMicroOpFactor, getLatencyFactor.</div><div><br></div><div>So if you have<br><div>def HW1 : ProcResource<15>;</div><div><div>def HW2 : ProcResource<3>;</div><div><br></div><div>LatencyFactor=15</div><div>ResourceFactor(HW1)=1</div><div>ResourceFactor(HW2)=5</div><div><br></div></div><div><blockquote type="cite" dir="auto"></blockquote></div><blockquote type="cite" dir="auto"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">2. Southern Islands has 256 registers, but there is a significant<br>performance penalty if you use more than a certain amount. Do any of<br>the MachineSchedulers support switching into an 'optimize for register<br>pressure mode' once it detects register pressure above a certain limit?<br></div></blockquote></div><div><br></div>The code in ConvergingScheduler (I’ll rename it to GenericScheduler soon) is meant to demonstrate most of the features so developers can copy what they need into their own strategy, add heuristics and change the underlying data structures, which often makes sense. You can decide whether you want only bottom-up, top-down, or both.<div><br></div><div>For an in-order processor, I think this becomes much simpler. You do away with most of the complexity in ConvergingScheduler::SchedBoundary and implement a straightforward reservation table. If it’s fully pipelined then you just count resource units for the current cycle until one reaches the latency factor. If it’s not fully pipelined, then you need to define ResourceCycles in the machine’s SchedWrite definitions and implement a simple reservation table (mark earliest cycle at which a resource is used for bottom-up scheduling). Some of this can be made a generic utility, but it’s not much to implement.</div><div><div><br></div><div>Since the strategy defines the priority queues, you can do whatever you want for your register pressure heuristics. From scanning the full queue each time with dynamic heuristics, to resorting, to dynamically deferring nodes...<br><div><br></div></div><div>Note that the register pressure tracking is handled outside of the strategy, in ScheduleDAGMI. So you get this for free without duplication.</div><div><br></div><div>However, querying pressure change for a candidate is done by the strategy. The generic interface, getMaxPressureDelta(), is very clunky now. I’m going to improve it, but If you’re writing a target specific strategy, it’s probably easier to directly query a pressure set for a specific register class.</div><div><br></div><div>e.g.</div><div>RC =TRI->getRegClass(R)</div><div>*PSetID = TRI->getRegClassPressureSets(RC)</div><div><br></div><div>Now you have a raw pointer right into the machine model’s null terminated array of PSetIDs that are affected by a register class (targets often have several overlapping register classes). You can choose one of those sets to track or track them all. I’m about to commit a patch that will have them sorted by number of regs in the set, so you can easily grab the largest (end of the list).</div><div><br></div><div>Then you can directly query pressure for a specific set...</div><div><br></div><div>P = RPTracker.getPressureAfterInst(I)</div><div>diff = P[PsetID] - RPTracker.getRegSetPressureAtPos()[PSetID]</div><div><br></div><div>Note that how you define your target’s registers can make a big difference in the pressure set formation. Yours don’t look to bad, but in general remember to use isAllocatable=0 for any classes that don’t take part in regalloc.</div></div><div><br></div><div>-Andy</div></body></html>