It seems to me that LLVM sub-register is not for the following hardware architecture.<br><br>All instructions of a hardware are vector instructions. All registers contains<br>4 32-bit FP sub-registers. They are called r0.x, r0.y, r0.z, r0.w.<br>

<br>Most instructions write more than one elements in this way:<br><br>  mul r0.xyw, r1, r2<br>  add r0.z, r3, r4<br>  sub r5, r0, r1<br>  <br>Notice that the four elements of r0 are written by two different instructions. <br>

  <br>My question is how should I model these sub-registers. If I treat each component<br>as a register, and do the register allocation individually, it seems very <br>difficult to merge the scalars operations back into one vetor operation. <br>

<br>  // each %reg is a sub-register<br>  // r1, r2, r3, r4 here are virtual register number<br>  <br>  mul %reg1024, r1, r2  // x<br>  mul %reg1025, r1, r2  // y<br>  mul %reg1026, r1, r2  // z<br>  <br>  add %reg1027, r3, r4  // w<br>

  <br>  sub %reg1028, %reg1024, r1  <br>  sub %reg1029, %reg1025, r1<br>  sub %reg1030, %reg1026, r1<br>  sub %reg1031, %reg1027, r1<br><br>So I decided to model each 4-element register as one Register in *.td file.<br><br>

Here are the details.<br><br>Since all the 4 elements of a vector register occupy the same 'alloca', <br>during the conversion of shader assembly to LLVM IR, I check if a vector <br>register is written (to different elements) by different instructions. When <br>

the second write happens, I generate a shufflevector to multiplex the <br>existing value and the new value, and store the result of shufflevector.<br><br>Input assembly language:<br>  mul r0.xy, r1, r2<br>  add <a href="http://r0.zw">r0.zw</a>, r3, r4<br>

  sub r5, r0, r1<br>  <br>is converted to LLVM IR:  <br><br>  %r0 = alloca <4 x float><br>  %mul_1 = mul <4 x float> %r1, %r2<br>  store <4 x float> %mul_1, <4 x float>* %r0<br>  ...<br>  %add_1 = add <4 x float> %r3, %r4<br>

  ; a store does not immediately happen here<br>  %load_1 = load <4 x float>* %r0         <br>  <br>  ; select the first two elements from the existing value,<br>  ; the last two elements from the newly generated value  <br>

  %merge_1 = shufflevector <4 x float> %load_1, <br>                           <4 x float> %add_1, <br>                           <4 x i32> < i32 0, i32 1, i32 6, i32 7 ><br>                           <br>

  ; store the multiplexed value                           <br>  store <4 x float> %merge_1, <4 x float>* %r0<br>  <br><br>After mem2reg:<br><br>  %mul_1 = mul <4 x float> %r1, %r2<br>  %add_1 = add <4 x float> %r3, %r4<br>

  %merge_1 = shufflevector <4 x float> %mul_1, <br>                           <4 x float> %add_1, <br>                           <4 x i32> < i32 0, i32 1, i32 6, i32 7 ><br>                           <br>

<br>After instruction selection:<br><br>  MUL   %reg1024, %reg1025, %reg1026                           <br>  ADD   %reg1027, %reg1028, %reg1029<br>  MERGE %reg1030, %reg1024, "xy", %reg1027, "zw"<br>   <br>

The 'shufflevector' is selected to a MERGE instruction by the default LLVM<br>instruction selector. The hardware doesn't have this instruction. I have a <br>*pre*-register allocation FunctionPass to remember:<br>

  <br>  The phyicial regsiter allocated to the destination register of MERGE <br>  (%reg1030) should replace the destination register allocated to the <br>  destination register of MUL (%reg1024) and ADD(%reg1027).<br>  <br>

In this way I ensure MUL and ADD write to the same physical register. This <br>replacement is done in the other FunctionPass *after* register allocation.<br><br>MUL and ADD have an 'OptionalDefOperand' writemask. By default the writemask is<br>

"xyzw" (all elmenets are written). <br><br>  // 0xF == all elements are written by default<br>  def WRITEMASK : OptionalDefOperand<OtherVT, (ops i32imm), (ops (i32 0xF))> <br>  {...}<br>  <br>  def MUL : MyInst<(outs REG4X32:$dst),<br>

                   (ins  REG4X32:$src0, REG4X32:$src1, WRITEMASK:$wm),<br>                    <br>In the said post-register-allocation FunctionPass, in addition to replace the <br>destination registers as described before, the writemask ($wm) of each <br>

instruction is also replaced with the writemask operands of MERGE. So:<br><br>  MUL   %R0, %R1, %R2, "xyzw"                           <br>  ADD   %R5, %R3, %R4, "xyzw"<br>  MERGE %R6, %R0, "xy", %R5, "zw"<br>

  <br>==><br><br>  MUL   %R6, %R1, %R2, "xy"  // "xy" comes from MERGE operand 2<br>  ADD   %R6, %R3, %R4, "zw"<br>  // MERGE %R6, %R0, "xy", %R5, "zw" <== REMOVED<br>

 <br>Final machine code:<br><br>  MUL r6.xy, r1, r2<br>  ADD <a href="http://r6.zw">r6.zw</a>, r3, r4<br>  SUB r8, r6, r1<br><br>I don't feel very comfortable with these two very ad-hoc FunctionPass.<br><br>Alex.<br>