<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><div><div>On Feb 25, 2014, at 10:55 AM, Hal Finkel <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">----- Original Message -----<br><blockquote type="cite">From: "Stefan Hepp" <<a href="mailto:stefan@stefant.org">stefan@stefant.org</a>><br>To: "Tom Stellard" <<a href="mailto:tom@stellard.net">tom@stellard.net</a>><br>Cc: "<a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a> List" <<a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a>><br>Sent: Tuesday, February 25, 2014 12:22:54 PM<br>Subject: Re: [LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs<br><br>Hi Tom,<br><br>Thanks a lot for your explanations, now it makes a lot more sense ;)<br><br>I had a slightly closer look at the R600 packetizer, and the issue is<br>that the third LSHL instruction has both an implicit use and<br>*afterwards* an implicit def of T1_XYZW. The latter def causes the<br>current ScheduleDAGInstrs implementation to ignore the implicit use,<br>thus the ScheduleDAG only contains an anti-dependency from the second<br>to<br>the third LSHL and the packetizer can bundle the instructions.<br><br>If the order of the implicit-defs and implicit-use would be different<br>(e.g., like TableGen adds them), or if I apply my patch so that the<br>implicit-use is not ignored, there is a true dependency edge between<br>those two instructions, preventing bundling.<br><br>Removing only the implicit-defs alone has no effect in this example,<br>as<br>then the implicit-use of T1_XYZW is then no longer removed and a true<br>dependency to the def of T1_W is added. Removing all implicit<br>operands<br>however solves that (see attached patch/hack).<br><br>I would argue that the behaviour of<br>ScheduleDAGInstrs::buildSchedGraph<br>for implicit operands is not correct, simply because changing the<br>order<br>of the operands would already prevent the R600 packetizer from<br>bundling<br>in this example.<br></blockquote><br>Unless there is supposed to be an ordering, I agree that this does not make sense.<br></div></blockquote><div><br></div><div>IIRC the implicit operands shouldn’t need to be ordered. Anyway, if the MachineVerifier allows it, I think we should support it. That makes it difficult to optimize liveness tracking routines like this, but so be it.</div><div><br></div><div>I don’t know of a clever way to handle this other than collecting all the Uses in a vector, and inserting them in the Uses set after processing all defs. Patches welcome.</div><div><br></div><div>-Andy</div></div><div><br><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;"><br>-Hal<br><br><blockquote type="cite"><br>Cheers,<br> Stefan<br><br>On 2014-02-25 17:35, Tom Stellard wrote:<br><blockquote type="cite">On Tue, Feb 25, 2014 at 04:38:46PM +0100, Stefan Hepp wrote:<br><blockquote type="cite">Hello,<br><br>The ScheduleDAGInstrs::buildSchedGraph() function creates def/uses<br>lists by iterating over all instruction operands and calls<br>addPhysRegDeps() if used post-RA (line ~770 ff.). If an operand<br>is a def, the uses of that registers are cleared<br>(ScheduleDAGInstrs.cpp:333: Uses.eraseAll(Reg); ).<br><br>As a consequence, if an instruction has an explicit use of a<br>register and an implicit def of the same register, the implicit<br>def causes the use to be removed. In our case this leads to<br>missing scheduling dependencies for indirect calls where a return<br>register is used as operand of the indirect call. Therefore I<br>patched the buildSchedGraph() function to iterate over all defs<br>first, then over all uses in a second iteration (maybe not the<br>most efficient solution, but it solves the issue).<br>I am attaching the corresponding patch.<br><br>However, while upgrading our code to LLVM 3.4, I found that the<br>test/CodeGen/R600/store.ll test fails for -mcpu=redwood due to<br>this change. The difference is in the following line in<br>@store_i8: With my patch applied I get:<br><br> LSHL * T0.W, PV.W, literal.x,<br> 3(4.203895e-45), 0(0.000000e+00)<br> LSHL * T1.X, T1.W, PV.W,<br> LSHL * T1.W, literal.x, T0.W,<br> 255(3.573311e-43), 0(0.000000e+00)<br><br>versus the orginal output of LLVM 3.4:<br><br> LSHL * T0.W, PV.W, literal.x,<br> 3(4.203895e-45), 0(0.000000e+00)<br> LSHL T1.X, T1.W, PV.W,<br> LSHL * T1.W, literal.x, PV.W,<br> 255(3.573311e-43), 0(0.000000e+00)<br><br>The difference is the '*' in the second LSHL and the T0.W versus<br>PV.W in the third LSHL.<br><br>Looking at the output of -print-after-all, llc generates the<br>following MC in both plain LLVM 3.4 and LLVM 3.4 with my patch<br>applied right after the "Finalize machine instruction bundles"<br>pass:<br><br> %T0_W<def> = LSHL_eg 0, 0, 1, 0, 0, 0, %T0_W<kill>, 0, 0, 0,<br> -1, %ALU_LITERAL_X, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 3, 0<br> %T1_X<def> = LSHL_eg 0, 0, 1, 0, 0, 0, %T1_W<kill>, 0, 0, 0,<br> -1, %T0_W, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 0, 0,<br> %T1_XYZW<imp-def><br> %T1_W<def> = LSHL_eg 0, 0, 1, 0, 0, 0, %ALU_LITERAL_X, 0, 0,<br> 0, -1, %T0_W<kill>, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 255,<br> 0, %T1_XYZW<imp-use,kill>, %T1_XYZW<imp-def><br><br>Hence, my patch affects either the R600 Packetizer or the R600<br>Control Flow Finalizer pass. A closer look reveals that the 17th<br>operand (the one before the 'pred:$PRED_SEL_OFF' operand)<br>controls the '*' output and is actually set for all three LSHL<br>operands before the Packetizer/Control Flow Finalizer pass, and<br>from a quick glance at the AMD ISA docs I gather that 'PV' versus<br>'T0' seems to be related to this, but the packetizer (?) clears<br>this flag for the second LSHR for redwood/VLIW5 GPUs.<br><br><br>Now, here is my question: Is the R600/store.ll test actually<br>correct (I am not yet familiar enough with the R600 ISA to tell),<br>or does the R600 backend rely on the fact that implicit defs kill<br>the uses of registers, or is there a bug/.. in the R600 backend<br>that counteracts the implicit-defs scheduling issue?<br>And is removing the explicit use by an implicit def at an<br>instruction a bug or a feature?<br><br>If this is an actual bug in the ScheduleDAGInstrs implementation,<br>I would like to adapt my patch so that it can be applied (i.e.,<br>include a fix for the store.ll test), although there might be a<br>better way of doing this (e.g., only erase uses that are not from<br>the same SU, or iterate over defs, implicit-defs, uses,<br>implicit-uses instead of iterating over all operands twice), and<br>I would need someone more familiar with the R600 backend to check<br>if the new output of LLC for store.ll is actually correct or not.<br><br></blockquote><br>Hi Stefan,<br><br>Thanks for the detailed analysis. The '*' marks the end of a<br>packet, so<br>with your patch, the sequence uses 3 packets, and it uses only 2<br>with<br>LLVM 3.4. So, this is a performance regression.<br><br>The root of this problem is that LLVM assumes it is impossible to<br>concurrently access two sub-registers of the same super-register,<br>so<br>it creates a scheduling dependency between the two sub-register<br>defs.<br><br>This is a problem with R600's VLIW architecture, because we want to<br>be<br>able to access all four sub-registers of the same packet, but<br>scheduling<br>dependencies prevent this.<br><br>This is why there are implicit uses and defs of the T*_XYZW<br>registers<br>added to these instructions. Some of these are added by LLVM<br>passes<br>and I think some are added by the R600MachineScheduler to work<br>around<br>the issue described (I'm not that familiar with the scheduler, so I<br>may<br>be wrong about this).<br><br>I think there was an effort recently to try allow concurrent<br>sub-register accesses, but I don't remember who was working on<br>this.<br><br>In this case, it looks like problem is that the last two<br>instructions<br>both define the same super-register, so the packetizer isn't<br>putting<br>them in the same packet.<br><br>So, I think the conclusion is that your patch increases the<br>exposure of<br>the R600 to an already present problem in the register allocator /<br>scheduler. I wonder if we could work around this by stripping all<br>the<br>implicit defs and uses off the instructions once scheduling and<br>register<br>allocation has completed. They are technically incorrect, and I'm<br>not<br>sure if any other passes use them. Do you have any suggestions?<br><br>Thanks,<br>Tom<br><br><br></blockquote><br><br>_______________________________________________<br>LLVM Developers mailing list<br><a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a><span class="Apple-converted-space"> </span> <a href="http://llvm.cs.uiuc.edu/">http://llvm.cs.uiuc.edu</a><br><a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br><br></blockquote><br>--<span class="Apple-converted-space"> </span><br>Hal Finkel<br>Assistant Computational Scientist<br>Leadership Computing Facility<br>Argonne National Laboratory</div></blockquote></div><br></body></html>