[LLVMdev] Modeling GPU vector registers, again (with my implementation)

Fri Feb 13 09:47:52 PST 2009

It seems to me that LLVM sub-register is not for the following hardware
architecture.

All instructions of a hardware are vector instructions. All registers
contains
4 32-bit FP sub-registers. They are called r0.x, r0.y, r0.z, r0.w.

Most instructions write more than one elements in this way:

  mul r0.xyw, r1, r2
  add r0.z, r3, r4
  sub r5, r0, r1

Notice that the four elements of r0 are written by two different
instructions.

My question is how should I model these sub-registers. If I treat each
component
as a register, and do the register allocation individually, it seems very
difficult to merge the scalars operations back into one vetor operation.

  // each %reg is a sub-register
  // r1, r2, r3, r4 here are virtual register number

  mul %reg1024, r1, r2  // x
  mul %reg1025, r1, r2  // y
  mul %reg1026, r1, r2  // z

  add %reg1027, r3, r4  // w

  sub %reg1028, %reg1024, r1
  sub %reg1029, %reg1025, r1
  sub %reg1030, %reg1026, r1
  sub %reg1031, %reg1027, r1

So I decided to model each 4-element register as one Register in *.td file.

Here are the details.

Since all the 4 elements of a vector register occupy the same 'alloca',
during the conversion of shader assembly to LLVM IR, I check if a vector
register is written (to different elements) by different instructions. When
the second write happens, I generate a shufflevector to multiplex the
existing value and the new value, and store the result of shufflevector.

Input assembly language:
  mul r0.xy, r1, r2
  add r0.zw, r3, r4
  sub r5, r0, r1

is converted to LLVM IR:

  %r0 = alloca <4 x float>
  %mul_1 = mul <4 x float> %r1, %r2
  store <4 x float> %mul_1, <4 x float>* %r0
  ...
  %add_1 = add <4 x float> %r3, %r4
  ; a store does not immediately happen here
  %load_1 = load <4 x float>* %r0

  ; select the first two elements from the existing value,
  ; the last two elements from the newly generated value
  %merge_1 = shufflevector <4 x float> %load_1,
                           <4 x float> %add_1,
                           <4 x i32> < i32 0, i32 1, i32 6, i32 7 >

  ; store the multiplexed value
  store <4 x float> %merge_1, <4 x float>* %r0

After mem2reg:

  %mul_1 = mul <4 x float> %r1, %r2
  %add_1 = add <4 x float> %r3, %r4
  %merge_1 = shufflevector <4 x float> %mul_1,
                           <4 x float> %add_1,
                           <4 x i32> < i32 0, i32 1, i32 6, i32 7 >

After instruction selection:

  MUL   %reg1024, %reg1025, %reg1026
  ADD   %reg1027, %reg1028, %reg1029
  MERGE %reg1030, %reg1024, "xy", %reg1027, "zw"

The 'shufflevector' is selected to a MERGE instruction by the default LLVM
instruction selector. The hardware doesn't have this instruction. I have a
*pre*-register allocation FunctionPass to remember:

  The phyicial regsiter allocated to the destination register of MERGE
  (%reg1030) should replace the destination register allocated to the
  destination register of MUL (%reg1024) and ADD(%reg1027).

In this way I ensure MUL and ADD write to the same physical register. This
replacement is done in the other FunctionPass *after* register allocation.

MUL and ADD have an 'OptionalDefOperand' writemask. By default the writemask
is
"xyzw" (all elmenets are written).

  // 0xF == all elements are written by default
  def WRITEMASK : OptionalDefOperand<OtherVT, (ops i32imm), (ops (i32 0xF))>

  {...}

  def MUL : MyInst<(outs REG4X32:$dst),
                   (ins  REG4X32:$src0, REG4X32:$src1, WRITEMASK:$wm),

In the said post-register-allocation FunctionPass, in addition to replace
the
destination registers as described before, the writemask ($wm) of each
instruction is also replaced with the writemask operands of MERGE. So:

  MUL   %R0, %R1, %R2, "xyzw"
  ADD   %R5, %R3, %R4, "xyzw"
  MERGE %R6, %R0, "xy", %R5, "zw"

==>

  MUL   %R6, %R1, %R2, "xy"  // "xy" comes from MERGE operand 2
  ADD   %R6, %R3, %R4, "zw"
  // MERGE %R6, %R0, "xy", %R5, "zw" <== REMOVED

Final machine code:

  MUL r6.xy, r1, r2
  ADD r6.zw, r3, r4
  SUB r8, r6, r1

I don't feel very comfortable with these two very ad-hoc FunctionPass.

Alex.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20090213/ec84f395/attachment.html>