[PATCH] D49671: [SchedModel] Propagate read advance cycles to implicit operands outside instruction descriptor

Mon Aug 20 10:18:02 PDT 2018

MatzeB added a comment.

In https://reviews.llvm.org/D49671#1173217, @jonpa wrote:

> In https://reviews.llvm.org/D49671#1172527, @MatzeB wrote:
>
> > This is tricky. Some comments:
> >
> > - Have you tried enabling subregister liveness tracking? Among other things it gets rid of the implicit-defs/uses for the full registers... (though there may be other factors influencing that decisions)
>
>
> Yes, IIRC I have tried that, but got crashes immediately which was discouraging. So for the moment, that is not something that could be the default for SystemZ, I think.
>
> > - What about just setting the latencies induced by the artifical implicit def-/uses[1] to 0? [1] = in lack of a better way to identify them, that would be all implicit vreg defs/uses that are not part of the MCInstrDesc.
>
> Yes, that was also my idea but as I wrote earlier in some rare cases I noticed instructions where the actual latency was only put on that extra regalloc operand, while the explicit use op had just a unit latency!
>
> I looked into this now a bit more, and it seems that in these cases a multiply or other instruction requires a double word register (128 bit), so a 64 bit register is coalesced into it:
>
>   Before Coalescing:
>  
>   16B       %0:gr64bit = LGFRL @seedi ::
>   128B      undef %5.subreg_l64:gr128bit = COPY %0:gr64bit
>   144B      %6:gr128bit = COPY %5:gr128bit
>   160B      %6:gr128bit = MLGR %6:gr128bit, %3:gr64bit
>  
>   After Coalescing:
>  
>   16B       undef %5.subreg_l64:gr128bit = LGFRL @seedi :: (dereferenceable load 4 from @seedi)
>   ...
>   144B      %6:gr128bit = COPY %5:gr128bit
>  
>   After RA:
>  
>   bb.0.entry:
>   renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
>   ...
>   renamable $r4q = COPY renamable $r0q
>  
>   After Post-RA pseudo instruction expansion pass:
>  
>   bb.0.entry:
>   renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
>   ...
>   $r4d = LGR $r0d, implicit $r0q
>   $r5d = LGR $r1d, implicit $r0q
>
>
> DAG has the latency on $r0q (superreg), instead of $r0d between SU(0) and SU(3).  ($r0q = $r0d + $r1d):
>
>   SU(0):   renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
>   # preds left       : 0
>   # succs left       : 9
>   # rdefs left       : 0
>   Latency            : 5
>   Depth              : 0
>   Height             : 70
>   Successors:
>   SU(4): Data Latency=5 Reg=$r1d
>   SU(4): Data Latency=5 Reg=$r0q
>   SU(3): Data Latency=5 Reg=$r0q
>   SU(3): Data Latency=1 Reg=$r0d
>   ...
>   SU(3):   $r4d = LGR $r0d, implicit $r0q
>   # preds left       : 2
>   # succs left       : 3
>   # rdefs left       : 0
>   Latency            : 1
>   Depth              : 5
>   Height             : 65
>   Predecessors:
>   SU(0): Data Latency=5 Reg=$r0q
>   SU(0): Data Latency=1 Reg=$r0d
>   ...
>   SU(4):   $r5d = LGR $r1d, implicit $r0q
>   # preds left       : 2
>   # succs left       : 3
>   # rdefs left       : 0
>   Latency            : 1
>   Depth              : 5
>   Height             : 65
>   Predecessors:
>   SU(0): Data Latency=5 Reg=$r1d
>   SU(0): Data Latency=5 Reg=$r0q
>   ...
>
>
> Seems like these are (rare) cases then where the defining instruction has an explicit def-op of a subregister, and a RegAlloc-implicit-def of the full register. The using instruction has an explicit use of the *other*
>  subreg, and an implicit use of the full register. The latency value is set only on the super-register (RegAlloc operand).

- The "implicit-def of the full register" is added when materializing the result of the regalloc in VirtRegMap.cpp, it is not present or necessary during regalloc itself.
- In your debug output I see:

  SU(0):   renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
  ...
  Successors:
  SU(4): Data Latency=5 Reg=$r1d
  SU(4): Data Latency=5 Reg=$r0q
  SU(3): Data Latency=5 Reg=$r0q
  SU(3): Data Latency=1 Reg=$r0d

Given that the actual instruction only writes to `r1d` I would argue that the latencies on `r0q` and `r0d` are "fake". Hence my proposal to ignore the extra operands during schedule dag construction or force their latency to zero. Or do you actually want a latency between SU(0) and SU(3) here?

> During DAG construction in computeOperandLatency(), when handling SU(0), I saw that
> 
> OperIdx 0: $r1d -> $r0q : Wlat0, Lat:5
>  OperIdx 2: $r0q -> $r0d : DefIdx = 1, but there is only one WriteLatencyEntry with the correct Value of 5! So the latency here becomes '1' instead. This is another example of how difficult these extra RA operands are to deal with (and it would be really ugly to have extra SchedWrites in the tablegen file just "in case regalloc decides to add one or a few more").
> 
> So, in short: we can't just set the latency on those regalloc operands to 1 whenever we want, because in these cases that would break the SchedModel. That said, this is extremely rare (35 cases out of 1.3 million) currently on SystemZ on SPEC, arising just in this scenario with a coalesced 128 bit register required by a particular instruction. So at least currently, that wouldn't probably matter if ignored... Still, maybe it would on other targets... Of course, this would currently be much better to do on SystemZ instead of the currently missing ReadAdvances...
> 
> I guess I wish that since the SchedModel has the quite intricate mapping of SchedWrites to operands (by means of ordering), it would hopefully end there, and not get disrupted with these extra operands... Defining a SchedModel to match the instruction definition operands is hard enough, and it doesn't work well to have to deal with extra implicit regalloc operands as well...
> 
> If we have to live with them on the MIs, perhaps we could make some decision to to not give them any latency values on the edges somehow, but to keep the latency values as defined by the tablegen files for the operands found there only?
> 
> That would probably be trading a "look-up" (this current patch), for another one, where an implicit operand not part of the MCInstrDesc would have to check for a subreg on the MI and get the latency from it... Not sure if that's even a good idea...

https://reviews.llvm.org/D49671