<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

<meta name="Generator" content="Microsoft Word 14 (filtered medium)">

<style><!--

/* Font Definitions */

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        margin-bottom:.0001pt;

        font-size:11.0pt;

        font-family:"Calibri","sans-serif";}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

span.EmailStyle17

        {mso-style-type:personal-compose;

        font-family:"Calibri","sans-serif";

        color:windowtext;}

.MsoChpDefault

        {mso-style-type:export-only;}

@page WordSection1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

</head>

<body lang="EN-US" link="blue" vlink="purple">

<div class="WordSection1">

<p class="MsoNormal">We have a function that has 256 loads and 256 fmuladds. This block of operations is bounded at either end by an OpenCL barrier (an AMDIL fence instruction). The loads and multiply/adds are ordinarily interleaved… that is, the IR going in

 to code generation looks like:<o:p></o:p></p>

<p class="MsoNormal">  %39 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x float]] addrspace(3)* @sgemm.b, i32 0, i32 0, i32 0), align 4<o:p></o:p></p>

<p class="MsoNormal">  %40 = call float @llvm.fmuladd.f32(float %37, float %39, float %c0.037) nounwind<o:p></o:p></p>

<p class="MsoNormal">  %41 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x float]] addrspace(3)* @sgemm.b, i32 0, i32 0, i32 1), align 4<o:p></o:p></p>

<p class="MsoNormal">  %42 = call float @llvm.fmuladd.f32(float %37, float %41, float %c1.036) nounwind<o:p></o:p></p>

<p class="MsoNormal">… and 254 more of these pairs.<o:p></o:p></p>

<p class="MsoNormal"><o:p> </o:p></p>

<p class="MsoNormal">%39 and %41 (and 254 more loads) are dead after they are used in the immediately following fmuladd.<o:p></o:p></p>

<p class="MsoNormal"><o:p> </o:p></p>

<p class="MsoNormal">RegReductionPQBase::getNodePriority() (in CodeGen/SelectionDAG/ScheduleDAGRRList.cpp) normally returns the SethiUllmanNumber for a node, but there’s a few special cases. ISD::TokenFactor and ISD::CopyToReg return a 0, to push them closer

 to their uses, and similarly for TargetOpcode::EXTRACT_SUBREG, TargetOpcode::SUBREG_TO_REG, and TargetOpcode::INSERT_SUBREG.<o:p></o:p></p>

<p class="MsoNormal">There is also a special case for instructions that are the end of a computational chain, or at the beginning, based on if the instruction has 0 predecessors or 0 successors.<o:p></o:p></p>

<p class="MsoNormal"><o:p> </o:p></p>

<p class="MsoNormal">Our fence instruction has 2 (constant) predecessors and no successors. This causes getNodePriority() to think it is the end of a computational chain and return 0xffff instead of the normal SethiUllmanNumber for the node, to try and get

 the instruction closer to where it’s constants are manifested.<o:p></o:p></p>

<p class="MsoNormal">The result is coming out code generation the loads and fmuladds are separated… We end up with a block of 256 loads, the fence instruction that was at the end of the block, then the 256 fmuladd operations.<o:p></o:p></p>

<p class="MsoNormal">This causes the live range of all 256 loads to GREATLY increase, increasing register pressure so much that we end up with absolutely awful performance.<o:p></o:p></p>

<p class="MsoNormal"><o:p> </o:p></p>

<p class="MsoNormal">We have a local quick fix for this (return the SethiUllmanNumber), but I wanted to get the advice of the list because I’d rather not have local modifications to “target independent” code generation.<o:p></o:p></p>

<p class="MsoNormal">Also, it feels like we must be doing something wrong either in describing our target or in later code generation to get this bad a result.<o:p></o:p></p>

<p class="MsoNormal"><o:p> </o:p></p>

<p class="MsoNormal">Richard<o:p></o:p></p>

</div>

</body>

</html>