[LLVMdev] getNodePriority()

Sun Apr 28 21:47:34 PDT 2013

On Apr 25, 2013, at 8:51 AM, "Relph, Richard" <Richard.Relph at amd.com> wrote:

> We have a function that has 256 loads and 256 fmuladds. This block of operations is bounded at either end by an OpenCL barrier (an AMDIL fence instruction). The loads and multiply/adds are ordinarily interleaved… that is, the IR going in to code generation looks like:
>   %39 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x float]] addrspace(3)* @sgemm.b, i32 0, i32 0, i32 0), align 4
>   %40 = call float @llvm.fmuladd.f32(float %37, float %39, float %c0.037) nounwind
>   %41 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x float]] addrspace(3)* @sgemm.b, i32 0, i32 0, i32 1), align 4
>   %42 = call float @llvm.fmuladd.f32(float %37, float %41, float %c1.036) nounwind
> … and 254 more of these pairs.
>  
> %39 and %41 (and 254 more loads) are dead after they are used in the immediately following fmuladd.
>  
> RegReductionPQBase::getNodePriority() (in CodeGen/SelectionDAG/ScheduleDAGRRList.cpp) normally returns the SethiUllmanNumber for a node, but there’s a few special cases. ISD::TokenFactor and ISD::CopyToReg return a 0, to push them closer to their uses, and similarly for TargetOpcode::EXTRACT_SUBREG, TargetOpcode::SUBREG_TO_REG, and TargetOpcode::INSERT_SUBREG.
> There is also a special case for instructions that are the end of a computational chain, or at the beginning, based on if the instruction has 0 predecessors or 0 successors.

The TargetOpcode checks are likely incorrect because they're not checking getMachineOpcode(), it's just that no one wants to change this nearly obsolete code and hunt down regressions. I would be happy to remove those checks altogether though if they cause problems. In your case I think it's unrelated.

> Our fence instruction has 2 (constant) predecessors and no successors. This causes getNodePriority() to think it is the end of a computational chain and return 0xffff instead of the normal SethiUllmanNumber for the node, to try and get the instruction closer to where it’s constants are manifested.
> The result is coming out code generation the loads and fmuladds are separated… We end up with a block of 256 loads, the fence instruction that was at the end of the block, then the 256 fmuladd operations.
> This causes the live range of all 256 loads to GREATLY increase, increasing register pressure so much that we end up with absolutely awful performance.
>  
> We have a local quick fix for this (return the SethiUllmanNumber), but I wanted to get the advice of the list because I’d rather not have local modifications to “target independent” code generation.
> Also, it feels like we must be doing something wrong either in describing our target or in later code generation to get this bad a result.

As we discussed off-list, please use -pre-RA-sched=source if possible, and introduce target-specific scheduling in the MachineScheduler pass. There are multiple ways to "plug in" to MachineScheduler.

-pre-RA-sched=source is currently being fixed to work as advertised. A patch is being worked on and expect to see it posted fairly soon. It's still usable as-is, but doesn't always preserve ordering.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130428/9ae5de30/attachment.html>