[LLVMdev] Register based vector insert/extract

Mon Apr 23 14:22:27 PDT 2007

On Mon, 23 Apr 2007, Christopher Lamb wrote:
>>> The issue I'm having is that there is no extract/insert
>>> instruction in the ISA, it's simply based on using subregister
>>> operands in subsequent/preliminary instructions. At the pointer of
>>> custom lowering register allocation has not yet been done, so I
>>> don't have a way to communicate the dependency.

Ok.

>> If I have a register v4r0 with subregisters {r0, r1, r2, r3} and a
>> DAG that looks like
>>
>> load v4si <- extract_element 2 <- add -> load i32
>>
>> I'd like to be able to generate
>>
>> load v4r0
>> load r10
>> add r11, r10, r2 <== subregister 2 of v4r0

Nice ISA.  That is entirely too logical. :)

We have a similar problem on X86.  In particular, an integer truncate or 
an extend (e.g. i16 -> i8) wants to make use of subregisters.  Consider 
code like this:

   t1 = load i16
   t2 = truncate i16 t1 to i8
   t3 = add i8 t2, 42

What we would really want to generate is something like this at the 
machine instr level:

   r1024 = X86_LOADi16 ...     ;; r1024 is i16
   r1026 = ADDi8 r1024[subreg #0], 42

More specifically, we want to be able to define, for each register class, 
a set of subregister classes.  In the X86 world, the 64-bit register 
classes could have subregclass0 = i8 parts, subregclass1 = i16 parts, 
subregclass2 = i32 parts.  Each <physreg, subreg#> pair should map to 
another physreg (e.g. <RAX,1> -> AX).

The idea of this is that the register allocator allocates registers like 
normal, but when it does the rewriting pass, when it replaces vregs with 
pregs (e.g. r1024 with CX in this example), it rewrites r1024[subreg0] 
with CL instead of CX.  This would give us this code:

   CX = X86_LOADi16 ...
   DL = ADDi8 CL, 42

In your case, you'd define your vector register class with 4 subregs, one 
for each piece.

Unfortunately, none of this exists yet :(.  To handle truncates and 
extends on X86, we currently emulate this by generating machineinstrs 
like:

   r1024 = X86_LOADi16 ...
   r1025 = TRUNCATE_i16_to_i8 r1024
   r1026 = ADDi8 r1025, 42

In the asmprinter, we print TRUNCATE_i16_to_i8 as a commented out noop if 
the register allocator happens to allocate 1024 and 1025 to the same 
register.  If not, it uses an asmprinter hack to print this as a copy 
instruction.  This is horrible, and doesn't produce good code.  OTOH, 
before Evan improved this, we always copied into AX and out of AL for each 
i16->i8 truncate, which was much worse :)

> I see that Evan has added getSubRegisters()/getSuperRegisters() to
> MRegisterInfo. This is what's needed in order to implement the
> register allocation constraint, but there's no way yet to pass the
> constraint through the operands from the DAG. There would need to be
> some way to specify that the SDOperand is referencing a subvalue of
> the produced value (perhaps a subclass of SDOperand?). This would
> allow the register allocator to try to use the sub/super register
> sets to perform the instert/extract.

Right.  Evan is currently focusing on getting the late stages of the code 
generator (e.g. livevars) to be able to understand arbitrary machine 
instrs in the face of physreg subregs.  This lays the groundwork for 
handling vreg subregs, but won't solve it directly.

> Is any of this kind of work planned? The addition of those
> MRegisterInfo functions has me curious...

This is on our mid-term plan, which means we'll probably tackle it over 
the next year or so, but we don't have any concrete plans in the immediate 
future.  If you are interested, this should be a pretty reasonable project 
that will give you a chance to become more familiar with various pieces of 
the early code generator. :)

-Chris

-- 
http://nondot.org/sabre/
http://llvm.org/