[LLVMdev] Modeling GPU vector registers, again (with my implementation)
Evan Cheng
evan.cheng at apple.com
Fri Feb 13 15:05:09 PST 2009
On Feb 13, 2009, at 9:47 AM, Alex wrote:
> It seems to me that LLVM sub-register is not for the following
> hardware architecture.
>
> All instructions of a hardware are vector instructions. All
> registers contains
> 4 32-bit FP sub-registers. They are called r0.x, r0.y, r0.z, r0.w.
>
> Most instructions write more than one elements in this way:
>
> mul r0.xyw, r1, r2
> add r0.z, r3, r4
> sub r5, r0, r1
>
> Notice that the four elements of r0 are written by two different
> instructions.
>
> My question is how should I model these sub-registers. If I treat
> each component
> as a register, and do the register allocation individually, it seems
> very
> difficult to merge the scalars operations back into one vetor
> operation.
Well, how many possible permutations are there? Is it possible to
model each case as a separate physical register?
Evan
> // each %reg is a sub-register
> // r1, r2, r3, r4 here are virtual register number
>
> mul %reg1024, r1, r2 // x
> mul %reg1025, r1, r2 // y
> mul %reg1026, r1, r2 // z
>
> add %reg1027, r3, r4 // w
>
> sub %reg1028, %reg1024, r1
> sub %reg1029, %reg1025, r1
> sub %reg1030, %reg1026, r1
> sub %reg1031, %reg1027, r1
>
> So I decided to model each 4-element register as one Register in
> *.td file.
>
> Here are the details.
>
> Since all the 4 elements of a vector register occupy the same
> 'alloca',
> during the conversion of shader assembly to LLVM IR, I check if a
> vector
> register is written (to different elements) by different
> instructions. When
> the second write happens, I generate a shufflevector to multiplex the
> existing value and the new value, and store the result of
> shufflevector.
>
> Input assembly language:
> mul r0.xy, r1, r2
> add r0.zw, r3, r4
> sub r5, r0, r1
>
> is converted to LLVM IR:
>
> %r0 = alloca <4 x float>
> %mul_1 = mul <4 x float> %r1, %r2
> store <4 x float> %mul_1, <4 x float>* %r0
> ...
> %add_1 = add <4 x float> %r3, %r4
> ; a store does not immediately happen here
> %load_1 = load <4 x float>* %r0
>
> ; select the first two elements from the existing value,
> ; the last two elements from the newly generated value
> %merge_1 = shufflevector <4 x float> %load_1,
> <4 x float> %add_1,
> <4 x i32> < i32 0, i32 1, i32 6, i32 7 >
>
> ; store the multiplexed value
> store <4 x float> %merge_1, <4 x float>* %r0
>
>
> After mem2reg:
>
> %mul_1 = mul <4 x float> %r1, %r2
> %add_1 = add <4 x float> %r3, %r4
> %merge_1 = shufflevector <4 x float> %mul_1,
> <4 x float> %add_1,
> <4 x i32> < i32 0, i32 1, i32 6, i32 7 >
>
>
> After instruction selection:
>
> MUL %reg1024, %reg1025, %reg1026
> ADD %reg1027, %reg1028, %reg1029
> MERGE %reg1030, %reg1024, "xy", %reg1027, "zw"
>
> The 'shufflevector' is selected to a MERGE instruction by the
> default LLVM
> instruction selector. The hardware doesn't have this instruction. I
> have a
> *pre*-register allocation FunctionPass to remember:
>
> The phyicial regsiter allocated to the destination register of MERGE
> (%reg1030) should replace the destination register allocated to the
> destination register of MUL (%reg1024) and ADD(%reg1027).
>
> In this way I ensure MUL and ADD write to the same physical
> register. This
> replacement is done in the other FunctionPass *after* register
> allocation.
>
> MUL and ADD have an 'OptionalDefOperand' writemask. By default the
> writemask is
> "xyzw" (all elmenets are written).
>
> // 0xF == all elements are written by default
> def WRITEMASK : OptionalDefOperand<OtherVT, (ops i32imm), (ops
> (i32 0xF))>
> {...}
>
> def MUL : MyInst<(outs REG4X32:$dst),
> (ins REG4X32:$src0, REG4X32:$src1, WRITEMASK:$wm),
>
> In the said post-register-allocation FunctionPass, in addition to
> replace the
> destination registers as described before, the writemask ($wm) of each
> instruction is also replaced with the writemask operands of MERGE. So:
>
> MUL %R0, %R1, %R2, "xyzw"
> ADD %R5, %R3, %R4, "xyzw"
> MERGE %R6, %R0, "xy", %R5, "zw"
>
> ==>
>
> MUL %R6, %R1, %R2, "xy" // "xy" comes from MERGE operand 2
> ADD %R6, %R3, %R4, "zw"
> // MERGE %R6, %R0, "xy", %R5, "zw" <== REMOVED
>
> Final machine code:
>
> MUL r6.xy, r1, r2
> ADD r6.zw, r3, r4
> SUB r8, r6, r1
>
> I don't feel very comfortable with these two very ad-hoc FunctionPass.
>
> Alex.
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20090213/0dc19c01/attachment.html>
More information about the llvm-dev
mailing list