[PATCH] Hexagon Register Cleanup

Mon May 13 09:07:21 PDT 2013

Hi,
This is the Hexagon pass that was meant to address the complications 
related to implicit uses and defs of super- and sub-registers.

To clarify the situation for everybody:
Hexagon has 32 registers R0..R31, and each is 32-bits.  Certain 
instructions can do 64-bit calculations, and their operands are 64-bit 
register pairs (even-odd).  These pairs are usually written as D0..D15, 
but there are in fact pairs R1:0, R3:2, R5:4, etc. (Hexagon is 
little-endian, hence the "reversed" notation).  It is not unusual to 
have the individual registers in a register pair be defined separately, 
and then used as a pair in another instruction, for example:
   R0 = ...
   R1 = ...
   ... = D0

This introduces certain complications with the current register 
allocation.  The problem is that the register rewriter will add implicit 
uses and implicit defs of super-registers when a sub-register is used or 
defined.  For example:
   %vreg1:subreg_loreg = COPY %vreg2:subreg_loreg
   %vreg1:subreg_hireg = COPY %vreg2:subreg_hireg
assuming that vreg1 becomes D0, and vreg2 becomes D1, would become
   %R0<def> = COPY %R2<use>, %D0<imp-def>, %D1<imp-use>
   %R1<def> = COPY %R3<use>, %D0<imp-def>, %D1<imp-use>

Hexagon is a VLIW machine, i.e. instructions are grouped into packets, 
and then the packets are executed as a unit (i.e. all instructions 
within a packet are executed in parallel, subject to certain 
limitations).  For performance it is much better to pack as many 
instructions in a packet as possible (architecture limit is 4), instead 
of having more packets with fewer instructions.

One restriction is that there cannot be any dependencies between 
instructions in a packet, so for the example above, the packetizer would 
be unable to put the two COPY instructions in the same packet, even 
though, from the architecture point of view, there are no dependencies 
and they can execute in parallel.  The reason for that would be that D0 
appears to be defined in both instructions (hence they cannot be 
parallelized).

This pass tries to solve this problem (and related issues) by shifting 
the liveness tracking from super-registers to sub-registers.  It does so 
by marking all explicit uses and defs of register pairs as "undef", and 
adds implicit uses and defs of the 32-bit components.  In addition to 
that, it removes the "extra" implicit uses and defs of super-registers 
(i.e. register pairs) that were added by the rewriter.  So, the above 
example would become
   %R0<def> = COPY %R2<use>
   %R1<def> = COPY %R3<use>
If we had an instruction that actually uses register pairs, such as
   %D0<def> = ADD64_rr %D1<use>, %D2<use>
it would be processed to look like this:
   %D0<def,undef> = ADD64_rr %D1<use,undef>, %D2<use,undef>,
                             %R0<imp-def>, %R1<imp-def>  // D0 = ...
                             %R2<imp-use>, %R3<imp-use>  // ... = D1
                             %R4<imp-use>, %R5<imp-use>  // ... = D2

The intent here is to mark the pairs as "undef" and thus remove them 
from dependence analysis.  The little problem here was that dependence 
analysis still considered those registers, hence if this transformation 
is enabled, it also forces ignoring of "undef" registers in the 
dependence analysis.  This is done using debug flags so that other 
targets are unaffected.

Since after this transformation, a former anti-dependence on a single 
register (register pair) now becomes an anti-dependence on two 32-bit 
registers, the existing anti-dependence breaking algorithm will no 
longer work in such cases.  The problem is that both sub-registers would 
need to be rewritten in such a way, as to remain in a "pair" 
relationship, e.g. R1:0 could become R5:4, but not just some two random 
32-bit registers.  To address this problem, there is an 
"anti-dependence" part in the HRC pass.

The whole transformation is divided into 3 stages:
1. "Finalize RA", where corrective actions are taken to address some 
undesirable outputs from the rewriter (see below).
2. "Anti-dep HRC", where the bulk of the work happens, i.e. putting the 
"undef" flag, and rewriting anti-dependencies on register pairs.
3. "Finalize", where the hijacking of "undef" ends, and the explicit 
register pairs become "legitimate def/use" again.

Issues with the rewritter mentioned above are that it will spill an 
entire 64-bit register, even when only a part of it was explicitly 
defined.  Normally, the whole 64-bit register would be "implicitly 
defined", as per the usual rewritter treatment, but since we are trying 
to track the sub-registers, we may end up with a store of R1:0, where 
only R0 was actually defined.  To address this, we simply add a 
definition of R1 to "complete" the definition of R1:0, so that it can be 
spilled as a whole.  Here's a bit on inefficiency injected, since we 
actually add an extra instruction, but overall this is still profitable 
for us.

This pass is written to be transparent to any other targets.  The only 
globally-visible change would be printing of the "undef" flag on 
MachineInstr operands.  The ignoring of the "undef" registers in 
dependence analysis should only happen on Hexagon, and only when HRC is 
enabled.

Please let me know if you have any comments.

Thanks,
-Krzysztof

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
hosted by The Linux Foundation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Hexagon-Register-Cleanup.patch
Type: text/x-patch
Size: 78972 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130513/fc50e793/attachment.bin>