[LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs

Tue Feb 25 19:39:30 PST 2014

On Feb 25, 2014, at 10:55 AM, Hal Finkel <hfinkel at anl.gov> wrote:

> ----- Original Message -----
>> From: "Stefan Hepp" <stefan at stefant.org>
>> To: "Tom Stellard" <tom at stellard.net>
>> Cc: "llvmdev at cs.uiuc.edu List" <llvmdev at cs.uiuc.edu>
>> Sent: Tuesday, February 25, 2014 12:22:54 PM
>> Subject: Re: [LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs
>> 
>> Hi Tom,
>> 
>> Thanks a lot for your explanations, now it makes a lot more sense ;)
>> 
>> I had a slightly closer look at the R600 packetizer, and the issue is
>> that the third LSHL instruction has both an implicit use and
>> *afterwards* an implicit def of T1_XYZW. The latter def causes the
>> current ScheduleDAGInstrs implementation to ignore the implicit use,
>> thus the ScheduleDAG only contains an anti-dependency from the second
>> to
>> the third LSHL and the packetizer can bundle the instructions.
>> 
>> If the order of the implicit-defs and implicit-use would be different
>> (e.g., like TableGen adds them), or if I apply my patch so that the
>> implicit-use is not ignored, there is a true dependency edge between
>> those two instructions, preventing bundling.
>> 
>> Removing only the implicit-defs alone has no effect in this example,
>> as
>> then the implicit-use of T1_XYZW is then no longer removed and a true
>> dependency to the def of T1_W is added. Removing all implicit
>> operands
>> however solves that (see attached patch/hack).
>> 
>> I would argue that the behaviour of
>> ScheduleDAGInstrs::buildSchedGraph
>> for implicit operands is not correct, simply because changing the
>> order
>> of the operands would already prevent the R600 packetizer from
>> bundling
>> in this example.
> 
> Unless there is supposed to be an ordering, I agree that this does not make sense.

IIRC the implicit operands shouldn’t need to be ordered. Anyway, if the MachineVerifier allows it, I think we should support it. That makes it difficult to optimize liveness tracking routines like this, but so be it.

I don’t know of a clever way to handle this other than collecting all the Uses in a vector, and inserting them in the Uses set after processing all defs. Patches welcome.

-Andy

> 
> -Hal
> 
>> 
>> Cheers,
>>  Stefan
>> 
>> On 2014-02-25 17:35, Tom Stellard wrote:
>>> On Tue, Feb 25, 2014 at 04:38:46PM +0100, Stefan Hepp wrote:
>>>> Hello,
>>>> 
>>>> The ScheduleDAGInstrs::buildSchedGraph() function creates def/uses
>>>> lists by iterating over all instruction operands and calls
>>>> addPhysRegDeps() if used post-RA (line ~770 ff.). If an operand
>>>> is a def, the uses of that registers are cleared
>>>> (ScheduleDAGInstrs.cpp:333:  Uses.eraseAll(Reg); ).
>>>> 
>>>> As a consequence, if an instruction has an explicit use of a
>>>> register and an implicit def of the same register, the implicit
>>>> def causes the use to be removed. In our case this leads to
>>>> missing scheduling dependencies for indirect calls where a return
>>>> register is used as operand of the indirect call. Therefore I
>>>> patched the buildSchedGraph() function to iterate over all defs
>>>> first, then over all uses in a second iteration (maybe not the
>>>> most efficient solution, but it solves the issue).
>>>> I am attaching the corresponding patch.
>>>> 
>>>> However, while upgrading our code to LLVM 3.4, I found that the
>>>> test/CodeGen/R600/store.ll test fails for -mcpu=redwood due to
>>>> this change. The difference is in the following line in
>>>> @store_i8: With my patch applied I get:
>>>> 
>>>>     LSHL                  * T0.W, PV.W, literal.x,
>>>>   3(4.203895e-45), 0(0.000000e+00)
>>>>     LSHL                  * T1.X, T1.W, PV.W,
>>>>     LSHL                  * T1.W, literal.x, T0.W,
>>>>   255(3.573311e-43), 0(0.000000e+00)
>>>> 
>>>> versus the orginal output of LLVM 3.4:
>>>> 
>>>>     LSHL                  * T0.W, PV.W, literal.x,
>>>>   3(4.203895e-45), 0(0.000000e+00)
>>>>     LSHL                    T1.X, T1.W, PV.W,
>>>>     LSHL                  * T1.W, literal.x, PV.W,
>>>>   255(3.573311e-43), 0(0.000000e+00)
>>>> 
>>>> The difference is the '*' in the second LSHL and the T0.W versus
>>>> PV.W in the third LSHL.
>>>> 
>>>> Looking at the output of -print-after-all, llc generates the
>>>> following MC in both plain LLVM 3.4 and LLVM 3.4 with my patch
>>>> applied right after the "Finalize machine instruction bundles"
>>>> pass:
>>>> 
>>>>     %T0_W<def> = LSHL_eg 0, 0, 1, 0, 0, 0, %T0_W<kill>, 0, 0, 0,
>>>>     -1, %ALU_LITERAL_X, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 3, 0
>>>>     %T1_X<def> = LSHL_eg 0, 0, 1, 0, 0, 0, %T1_W<kill>, 0, 0, 0,
>>>>     -1, %T0_W, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 0, 0,
>>>>     %T1_XYZW<imp-def>
>>>>     %T1_W<def> = LSHL_eg 0, 0, 1, 0, 0, 0, %ALU_LITERAL_X, 0, 0,
>>>>     0, -1, %T0_W<kill>, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 255,
>>>>     0, %T1_XYZW<imp-use,kill>, %T1_XYZW<imp-def>
>>>> 
>>>> Hence, my patch affects either the R600 Packetizer or the R600
>>>> Control Flow Finalizer pass. A closer look reveals that the 17th
>>>> operand (the one before the 'pred:$PRED_SEL_OFF' operand)
>>>> controls the '*' output and is actually set for all three LSHL
>>>> operands before the Packetizer/Control Flow Finalizer pass, and
>>>> from a quick glance at the AMD ISA docs I gather that 'PV' versus
>>>> 'T0' seems to be related to this, but the packetizer (?) clears
>>>> this flag for the second LSHR for redwood/VLIW5 GPUs.
>>>> 
>>>> 
>>>> Now, here is my question: Is the R600/store.ll test actually
>>>> correct (I am not yet familiar enough with the R600 ISA to tell),
>>>> or does the R600 backend rely on the fact that implicit defs kill
>>>> the uses of registers, or is there a bug/.. in the R600 backend
>>>> that counteracts the implicit-defs scheduling issue?
>>>> And is removing the explicit use by an implicit def at an
>>>> instruction a bug or a feature?
>>>> 
>>>> If this is an actual bug in the ScheduleDAGInstrs implementation,
>>>> I would like to adapt my patch so that it can be applied (i.e.,
>>>> include a fix for the store.ll test), although there might be a
>>>> better way of doing this (e.g., only erase uses that are not from
>>>> the same SU, or iterate over defs, implicit-defs, uses,
>>>> implicit-uses instead of iterating over all operands twice), and
>>>> I would need someone more familiar with the R600 backend to check
>>>> if the new output of LLC for store.ll is actually correct or not.
>>>> 
>>> 
>>> Hi Stefan,
>>> 
>>> Thanks for the detailed analysis.  The '*' marks the end of a
>>> packet, so
>>> with your patch, the sequence uses 3 packets, and it uses only 2
>>> with
>>> LLVM 3.4.  So, this is a performance regression.
>>> 
>>> The root of this problem is that LLVM assumes it is impossible to
>>> concurrently access two sub-registers of the same super-register,
>>> so
>>> it creates a scheduling dependency between the two sub-register
>>> defs.
>>> 
>>> This is a problem with R600's VLIW architecture, because we want to
>>> be
>>> able to access all four sub-registers of the same packet, but
>>> scheduling
>>> dependencies prevent this.
>>> 
>>> This is why there are implicit uses and defs of the T*_XYZW
>>> registers
>>> added to these instructions.  Some of these are added by LLVM
>>> passes
>>> and I think some are added by the R600MachineScheduler to work
>>> around
>>> the issue described (I'm not that familiar with the scheduler, so I
>>> may
>>> be wrong about this).
>>> 
>>> I think there was an effort recently to try allow concurrent
>>> sub-register accesses, but I don't remember who was working on
>>> this.
>>> 
>>> In this case, it looks like problem is that the last two
>>> instructions
>>> both define the same super-register, so the packetizer isn't
>>> putting
>>> them in the same packet.
>>> 
>>> So, I think the conclusion is that your patch increases the
>>> exposure of
>>> the R600 to an already present problem in the register allocator /
>>> scheduler.  I wonder if we could work around this by stripping all
>>> the
>>> implicit defs and uses off the instructions once scheduling and
>>> register
>>> allocation has completed.  They are technically incorrect, and I'm
>>> not
>>> sure if any other passes use them.  Do you have any suggestions?
>>> 
>>> Thanks,
>>> Tom
>>> 
>>> 
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140225/320c6081/attachment.html>