It seems to me that LLVM sub-register is not for the following hardware architecture.<br><br>All instructions of a hardware are vector instructions. All registers contains<br>4 32-bit FP sub-registers. They are called r0.x, r0.y, r0.z, r0.w.<br>
<br>Most instructions write more than one elements in this way:<br><br> mul r0.xyw, r1, r2<br> add r0.z, r3, r4<br> sub r5, r0, r1<br> <br>Notice that the four elements of r0 are written by two different instructions. <br>
<br>My question is how should I model these sub-registers. If I treat each component<br>as a register, and do the register allocation individually, it seems very <br>difficult to merge the scalars operations back into one vetor operation. <br>
<br> // each %reg is a sub-register<br> // r1, r2, r3, r4 here are virtual register number<br> <br> mul %reg1024, r1, r2 // x<br> mul %reg1025, r1, r2 // y<br> mul %reg1026, r1, r2 // z<br> <br> add %reg1027, r3, r4 // w<br>
<br> sub %reg1028, %reg1024, r1 <br> sub %reg1029, %reg1025, r1<br> sub %reg1030, %reg1026, r1<br> sub %reg1031, %reg1027, r1<br><br>So I decided to model each 4-element register as one Register in *.td file.<br><br>
Here are the details.<br><br>Since all the 4 elements of a vector register occupy the same 'alloca', <br>during the conversion of shader assembly to LLVM IR, I check if a vector <br>register is written (to different elements) by different instructions. When <br>
the second write happens, I generate a shufflevector to multiplex the <br>existing value and the new value, and store the result of shufflevector.<br><br>Input assembly language:<br> mul r0.xy, r1, r2<br> add <a href="http://r0.zw">r0.zw</a>, r3, r4<br>
sub r5, r0, r1<br> <br>is converted to LLVM IR: <br><br> %r0 = alloca <4 x float><br> %mul_1 = mul <4 x float> %r1, %r2<br> store <4 x float> %mul_1, <4 x float>* %r0<br> ...<br> %add_1 = add <4 x float> %r3, %r4<br>
; a store does not immediately happen here<br> %load_1 = load <4 x float>* %r0 <br> <br> ; select the first two elements from the existing value,<br> ; the last two elements from the newly generated value <br>
%merge_1 = shufflevector <4 x float> %load_1, <br> <4 x float> %add_1, <br> <4 x i32> < i32 0, i32 1, i32 6, i32 7 ><br> <br>
; store the multiplexed value <br> store <4 x float> %merge_1, <4 x float>* %r0<br> <br><br>After mem2reg:<br><br> %mul_1 = mul <4 x float> %r1, %r2<br> %add_1 = add <4 x float> %r3, %r4<br>
%merge_1 = shufflevector <4 x float> %mul_1, <br> <4 x float> %add_1, <br> <4 x i32> < i32 0, i32 1, i32 6, i32 7 ><br> <br>
<br>After instruction selection:<br><br> MUL %reg1024, %reg1025, %reg1026 <br> ADD %reg1027, %reg1028, %reg1029<br> MERGE %reg1030, %reg1024, "xy", %reg1027, "zw"<br> <br>
The 'shufflevector' is selected to a MERGE instruction by the default LLVM<br>instruction selector. The hardware doesn't have this instruction. I have a <br>*pre*-register allocation FunctionPass to remember:<br>
<br> The phyicial regsiter allocated to the destination register of MERGE <br> (%reg1030) should replace the destination register allocated to the <br> destination register of MUL (%reg1024) and ADD(%reg1027).<br> <br>
In this way I ensure MUL and ADD write to the same physical register. This <br>replacement is done in the other FunctionPass *after* register allocation.<br><br>MUL and ADD have an 'OptionalDefOperand' writemask. By default the writemask is<br>
"xyzw" (all elmenets are written). <br><br> // 0xF == all elements are written by default<br> def WRITEMASK : OptionalDefOperand<OtherVT, (ops i32imm), (ops (i32 0xF))> <br> {...}<br> <br> def MUL : MyInst<(outs REG4X32:$dst),<br>
(ins REG4X32:$src0, REG4X32:$src1, WRITEMASK:$wm),<br> <br>In the said post-register-allocation FunctionPass, in addition to replace the <br>destination registers as described before, the writemask ($wm) of each <br>
instruction is also replaced with the writemask operands of MERGE. So:<br><br> MUL %R0, %R1, %R2, "xyzw" <br> ADD %R5, %R3, %R4, "xyzw"<br> MERGE %R6, %R0, "xy", %R5, "zw"<br>
<br>==><br><br> MUL %R6, %R1, %R2, "xy" // "xy" comes from MERGE operand 2<br> ADD %R6, %R3, %R4, "zw"<br> // MERGE %R6, %R0, "xy", %R5, "zw" <== REMOVED<br>
<br>Final machine code:<br><br> MUL r6.xy, r1, r2<br> ADD <a href="http://r6.zw">r6.zw</a>, r3, r4<br> SUB r8, r6, r1<br><br>I don't feel very comfortable with these two very ad-hoc FunctionPass.<br><br>Alex.<br>