<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"><base href="x-msg://36/"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><div>On Apr 25, 2013, at 8:51 AM, "Relph, Richard" <<a href="mailto:Richard.Relph@amd.com">Richard.Relph@amd.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div lang="EN-US" link="blue" vlink="purple" style="font-family: Helvetica; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div class="WordSection1" style="page: WordSection1; "><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">We have a function that has 256 loads and 256 fmuladds. This block of operations is bounded at either end by an OpenCL barrier (an AMDIL fence instruction). The loads and multiply/adds are ordinarily interleaved… that is, the IR going in to code generation looks like:<o:p></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">  %39 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x float]] addrspace(3)* @sgemm.b, i32 0, i32 0, i32 0), align 4<o:p></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">  %40 = call float @llvm.fmuladd.f32(float %37, float %39, float %c0.037) nounwind<o:p></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">  %41 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x float]] addrspace(3)* @sgemm.b, i32 0, i32 0, i32 1), align 4<o:p></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">  %42 = call float @llvm.fmuladd.f32(float %37, float %41, float %c1.036) nounwind<o:p></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">… and 254 more of these pairs.<o:p></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; "><o:p> </o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">%39 and %41 (and 254 more loads) are dead after they are used in the immediately following fmuladd.<o:p></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; "><o:p> </o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">RegReductionPQBase::getNodePriority() (in CodeGen/SelectionDAG/ScheduleDAGRRList.cpp) normally returns the SethiUllmanNumber for a node, but there’s a few special cases. ISD::TokenFactor and ISD::CopyToReg return a 0, to push them closer to their uses, and similarly for TargetOpcode::EXTRACT_SUBREG, TargetOpcode::SUBREG_TO_REG, and TargetOpcode::INSERT_SUBREG.<o:p></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">There is also a special case for instructions that are the end of a computational chain, or at the beginning, based on if the instruction has 0 predecessors or 0 successors.</div></div></div></blockquote><div><br></div>The TargetOpcode checks are likely incorrect because they're not checking getMachineOpcode(), it's just that no one wants to change this nearly obsolete code and hunt down regressions. I would be happy to remove those checks altogether though if they cause problems. In your case I think it's unrelated.</div><div><br><blockquote type="cite"><div lang="EN-US" link="blue" vlink="purple" style="font-family: Helvetica; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div class="WordSection1" style="page: WordSection1; "><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">Our fence instruction has 2 (constant) predecessors and no successors. This causes getNodePriority() to think it is the end of a computational chain and return 0xffff instead of the normal SethiUllmanNumber for the node, to try and get the instruction closer to where it’s constants are manifested.<o:p></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">The result is coming out code generation the loads and fmuladds are separated… We end up with a block of 256 loads, the fence instruction that was at the end of the block, then the 256 fmuladd operations.<o:p></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">This causes the live range of all 256 loads to GREATLY increase, increasing register pressure so much that we end up with absolutely awful performance.<o:p></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; "><o:p> </o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">We have a local quick fix for this (return the SethiUllmanNumber), but I wanted to get the advice of the list because I’d rather not have local modifications to “target independent” code generation.<o:p></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">Also, it feels like we must be doing something wrong either in describing our target or in later code generation to get this bad a result.</div></div></div></blockquote><br></div><div>As we discussed off-list, please use -pre-RA-sched=source if possible, and introduce target-specific scheduling in the MachineScheduler pass. There are multiple ways to "plug in" to MachineScheduler.</div><div><br></div><div>-pre-RA-sched=source is currently being fixed to work as advertised. A patch is being worked on and expect to see it posted fairly soon. It's still usable as-is, but doesn't always preserve ordering.</div><div><br></div><div>-Andy</div></body></html>