[LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs

Tue Feb 25 10:28:29 PST 2014

I should probably mention that with those two patches all the LLVM tests 
are successful now for me, but it might be a good idea to adapt the 
patch to check if the removed operands are actually super-registers, in 
case the R600 backend uses implicit registers for other purposes as well.

Kind regards,
  Stefan

On 2014-02-25 19:22, Stefan Hepp wrote:
> Hi Tom,
>
> Thanks a lot for your explanations, now it makes a lot more sense ;)
>
> I had a slightly closer look at the R600 packetizer, and the issue is
> that the third LSHL instruction has both an implicit use and
> *afterwards* an implicit def of T1_XYZW. The latter def causes the
> current ScheduleDAGInstrs implementation to ignore the implicit use,
> thus the ScheduleDAG only contains an anti-dependency from the second to
> the third LSHL and the packetizer can bundle the instructions.
>
> If the order of the implicit-defs and implicit-use would be different
> (e.g., like TableGen adds them), or if I apply my patch so that the
> implicit-use is not ignored, there is a true dependency edge between
> those two instructions, preventing bundling.
>
> Removing only the implicit-defs alone has no effect in this example, as
> then the implicit-use of T1_XYZW is then no longer removed and a true
> dependency to the def of T1_W is added. Removing all implicit operands
> however solves that (see attached patch/hack).
>
> I would argue that the behaviour of ScheduleDAGInstrs::buildSchedGraph
> for implicit operands is not correct, simply because changing the order
> of the operands would already prevent the R600 packetizer from bundling
> in this example.
>
> Cheers,
>   Stefan
>
> On 2014-02-25 17:35, Tom Stellard wrote:
>> On Tue, Feb 25, 2014 at 04:38:46PM +0100, Stefan Hepp wrote:
>>> Hello,
>>>
>>> The ScheduleDAGInstrs::buildSchedGraph() function creates def/uses
>>> lists by iterating over all instruction operands and calls
>>> addPhysRegDeps() if used post-RA (line ~770 ff.). If an operand is a
>>> def, the uses of that registers are cleared
>>> (ScheduleDAGInstrs.cpp:333:  Uses.eraseAll(Reg); ).
>>>
>>> As a consequence, if an instruction has an explicit use of a register
>>> and an implicit def of the same register, the implicit def causes the
>>> use to be removed. In our case this leads to missing scheduling
>>> dependencies for indirect calls where a return register is used as
>>> operand of the indirect call. Therefore I patched the
>>> buildSchedGraph() function to iterate over all defs first, then over
>>> all uses in a second iteration (maybe not the most efficient
>>> solution, but it solves the issue).
>>> I am attaching the corresponding patch.
>>>
>>> However, while upgrading our code to LLVM 3.4, I found that the
>>> test/CodeGen/R600/store.ll test fails for -mcpu=redwood due to this
>>> change. The difference is in the following line in @store_i8: With my
>>> patch applied I get:
>>>
>>>      LSHL                  * T0.W, PV.W, literal.x,
>>>    3(4.203895e-45), 0(0.000000e+00)
>>>      LSHL                  * T1.X, T1.W, PV.W,
>>>      LSHL                  * T1.W, literal.x, T0.W,
>>>    255(3.573311e-43), 0(0.000000e+00)
>>>
>>> versus the orginal output of LLVM 3.4:
>>>
>>>      LSHL                  * T0.W, PV.W, literal.x,
>>>    3(4.203895e-45), 0(0.000000e+00)
>>>      LSHL                    T1.X, T1.W, PV.W,
>>>      LSHL                  * T1.W, literal.x, PV.W,
>>>    255(3.573311e-43), 0(0.000000e+00)
>>>
>>> The difference is the '*' in the second LSHL and the T0.W versus PV.W
>>> in the third LSHL.
>>>
>>> Looking at the output of -print-after-all, llc generates the
>>> following MC in both plain LLVM 3.4 and LLVM 3.4 with my patch
>>> applied right after the "Finalize machine instruction bundles" pass:
>>>
>>>      %T0_W<def> = LSHL_eg 0, 0, 1, 0, 0, 0, %T0_W<kill>, 0, 0, 0, -1,
>>> %ALU_LITERAL_X, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 3, 0
>>>      %T1_X<def> = LSHL_eg 0, 0, 1, 0, 0, 0, %T1_W<kill>, 0, 0, 0, -1,
>>> %T0_W, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 0, 0, %T1_XYZW<imp-def>
>>>      %T1_W<def> = LSHL_eg 0, 0, 1, 0, 0, 0, %ALU_LITERAL_X, 0, 0, 0,
>>> -1, %T0_W<kill>, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 255, 0,
>>> %T1_XYZW<imp-use,kill>, %T1_XYZW<imp-def>
>>>
>>> Hence, my patch affects either the R600 Packetizer or the R600
>>> Control Flow Finalizer pass. A closer look reveals that the 17th
>>> operand (the one before the 'pred:$PRED_SEL_OFF' operand) controls
>>> the '*' output and is actually set for all three LSHL operands before
>>> the Packetizer/Control Flow Finalizer pass, and from a quick glance
>>> at the AMD ISA docs I gather that 'PV' versus 'T0' seems to be
>>> related to this, but the packetizer (?) clears this flag for the
>>> second LSHR for redwood/VLIW5 GPUs.
>>>
>>>
>>> Now, here is my question: Is the R600/store.ll test actually correct
>>> (I am not yet familiar enough with the R600 ISA to tell), or does the
>>> R600 backend rely on the fact that implicit defs kill the uses of
>>> registers, or is there a bug/.. in the R600 backend that counteracts
>>> the implicit-defs scheduling issue?
>>> And is removing the explicit use by an implicit def at an instruction
>>> a bug or a feature?
>>>
>>> If this is an actual bug in the ScheduleDAGInstrs implementation, I
>>> would like to adapt my patch so that it can be applied (i.e., include
>>> a fix for the store.ll test), although there might be a better way of
>>> doing this (e.g., only erase uses that are not from the same SU, or
>>> iterate over defs, implicit-defs, uses, implicit-uses instead of
>>> iterating over all operands twice), and I would need someone more
>>> familiar with the R600 backend to check if the new output of LLC for
>>> store.ll is actually correct or not.
>>>
>>
>> Hi Stefan,
>>
>> Thanks for the detailed analysis.  The '*' marks the end of a packet, so
>> with your patch, the sequence uses 3 packets, and it uses only 2 with
>> LLVM 3.4.  So, this is a performance regression.
>>
>> The root of this problem is that LLVM assumes it is impossible to
>> concurrently access two sub-registers of the same super-register, so
>> it creates a scheduling dependency between the two sub-register defs.
>>
>> This is a problem with R600's VLIW architecture, because we want to be
>> able to access all four sub-registers of the same packet, but scheduling
>> dependencies prevent this.
>>
>> This is why there are implicit uses and defs of the T*_XYZW registers
>> added to these instructions.  Some of these are added by LLVM passes
>> and I think some are added by the R600MachineScheduler to work around
>> the issue described (I'm not that familiar with the scheduler, so I may
>> be wrong about this).
>>
>> I think there was an effort recently to try allow concurrent
>> sub-register accesses, but I don't remember who was working on this.
>>
>> In this case, it looks like problem is that the last two instructions
>> both define the same super-register, so the packetizer isn't putting
>> them in the same packet.
>>
>> So, I think the conclusion is that your patch increases the exposure of
>> the R600 to an already present problem in the register allocator /
>> scheduler.  I wonder if we could work around this by stripping all the
>> implicit defs and uses off the instructions once scheduling and register
>> allocation has completed.  They are technically incorrect, and I'm not
>> sure if any other passes use them.  Do you have any suggestions?
>>
>> Thanks,
>> Tom
>>
>>
>