[LLVMdev] Modeling GPU vector registers, again (with my implementation)

Fri Feb 13 15:05:09 PST 2009

On Feb 13, 2009, at 9:47 AM, Alex wrote:

> It seems to me that LLVM sub-register is not for the following  
> hardware architecture.
>
> All instructions of a hardware are vector instructions. All  
> registers contains
> 4 32-bit FP sub-registers. They are called r0.x, r0.y, r0.z, r0.w.
>
> Most instructions write more than one elements in this way:
>
>   mul r0.xyw, r1, r2
>   add r0.z, r3, r4
>   sub r5, r0, r1
>
> Notice that the four elements of r0 are written by two different  
> instructions.
>
> My question is how should I model these sub-registers. If I treat  
> each component
> as a register, and do the register allocation individually, it seems  
> very
> difficult to merge the scalars operations back into one vetor  
> operation.

Well, how many possible permutations are there? Is it possible to  
model each case as a separate physical register?

Evan

>   // each %reg is a sub-register
>   // r1, r2, r3, r4 here are virtual register number
>
>   mul %reg1024, r1, r2  // x
>   mul %reg1025, r1, r2  // y
>   mul %reg1026, r1, r2  // z
>
>   add %reg1027, r3, r4  // w
>
>   sub %reg1028, %reg1024, r1
>   sub %reg1029, %reg1025, r1
>   sub %reg1030, %reg1026, r1
>   sub %reg1031, %reg1027, r1
>
> So I decided to model each 4-element register as one Register in  
> *.td file.
>
> Here are the details.
>
> Since all the 4 elements of a vector register occupy the same  
> 'alloca',
> during the conversion of shader assembly to LLVM IR, I check if a  
> vector
> register is written (to different elements) by different  
> instructions. When
> the second write happens, I generate a shufflevector to multiplex the
> existing value and the new value, and store the result of  
> shufflevector.
>
> Input assembly language:
>   mul r0.xy, r1, r2
>   add r0.zw, r3, r4
>   sub r5, r0, r1
>
> is converted to LLVM IR:
>
>   %r0 = alloca <4 x float>
>   %mul_1 = mul <4 x float> %r1, %r2
>   store <4 x float> %mul_1, <4 x float>* %r0
>   ...
>   %add_1 = add <4 x float> %r3, %r4
>   ; a store does not immediately happen here
>   %load_1 = load <4 x float>* %r0
>
>   ; select the first two elements from the existing value,
>   ; the last two elements from the newly generated value
>   %merge_1 = shufflevector <4 x float> %load_1,
>                            <4 x float> %add_1,
>                            <4 x i32> < i32 0, i32 1, i32 6, i32 7 >
>
>   ; store the multiplexed value
>   store <4 x float> %merge_1, <4 x float>* %r0
>
>
> After mem2reg:
>
>   %mul_1 = mul <4 x float> %r1, %r2
>   %add_1 = add <4 x float> %r3, %r4
>   %merge_1 = shufflevector <4 x float> %mul_1,
>                            <4 x float> %add_1,
>                            <4 x i32> < i32 0, i32 1, i32 6, i32 7 >
>
>
> After instruction selection:
>
>   MUL   %reg1024, %reg1025, %reg1026
>   ADD   %reg1027, %reg1028, %reg1029
>   MERGE %reg1030, %reg1024, "xy", %reg1027, "zw"
>
> The 'shufflevector' is selected to a MERGE instruction by the  
> default LLVM
> instruction selector. The hardware doesn't have this instruction. I  
> have a
> *pre*-register allocation FunctionPass to remember:
>
>   The phyicial regsiter allocated to the destination register of MERGE
>   (%reg1030) should replace the destination register allocated to the
>   destination register of MUL (%reg1024) and ADD(%reg1027).
>
> In this way I ensure MUL and ADD write to the same physical  
> register. This
> replacement is done in the other FunctionPass *after* register  
> allocation.
>
> MUL and ADD have an 'OptionalDefOperand' writemask. By default the  
> writemask is
> "xyzw" (all elmenets are written).
>
>   // 0xF == all elements are written by default
>   def WRITEMASK : OptionalDefOperand<OtherVT, (ops i32imm), (ops  
> (i32 0xF))>
>   {...}
>
>   def MUL : MyInst<(outs REG4X32:$dst),
>                    (ins  REG4X32:$src0, REG4X32:$src1, WRITEMASK:$wm),
>
> In the said post-register-allocation FunctionPass, in addition to  
> replace the
> destination registers as described before, the writemask ($wm) of each
> instruction is also replaced with the writemask operands of MERGE. So:
>
>   MUL   %R0, %R1, %R2, "xyzw"
>   ADD   %R5, %R3, %R4, "xyzw"
>   MERGE %R6, %R0, "xy", %R5, "zw"
>
> ==>
>
>   MUL   %R6, %R1, %R2, "xy"  // "xy" comes from MERGE operand 2
>   ADD   %R6, %R3, %R4, "zw"
>   // MERGE %R6, %R0, "xy", %R5, "zw" <== REMOVED
>
> Final machine code:
>
>   MUL r6.xy, r1, r2
>   ADD r6.zw, r3, r4
>   SUB r8, r6, r1
>
> I don't feel very comfortable with these two very ad-hoc FunctionPass.
>
> Alex.
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20090213/0dc19c01/attachment.html>