[PATCH] D78091: [AMDGPU] Enable carry out ADD/SUB operations divergence driven instruction selection.

Fri Apr 24 05:23:25 PDT 2020

alex-t added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/SIISelLowering.cpp:3698
+                                            ? MRI.getRegClass(Src1.getReg())
+                                            : &AMDGPU::VGPR_32RegClass;
+    const TargetRegisterClass *Src0SubRC =
----------------
rampitec wrote:
> alex-t wrote:
> > rampitec wrote:
> > > alex-t wrote:
> > > > rampitec wrote:
> > > > > VReg_64? Since it did not fail anywhere this case must be not covered by any tests.
> > > > I maybe misunderstand the documentation, but it says that the size we only can have 32bit immediate aa operand.
> > > > I also did some experiments with different targets (gfx600,900,1010) and always have seen that 64bit size constant was split into 2 32bit parts for addition.
> > > > Please correct me if I understand it in a wrong way.
> > > Two lines below you are asking for sub0 of that RC. VGPR_32 does not have sub0.
> > Yes. And it is exactly what is expected :) 
> > getSubRegClass   returns VGPR_32RegClass itself in this case. In fact id does not matter what it is.
> > buildExtractSubRegOrImm does not use SubRC argument if operand is immediate.
> > 
> > 
> > ```
> >  if (Op.isImm()) {
> >     if (SubIdx == AMDGPU::sub0)
> >       return MachineOperand::CreateImm(static_cast<int32_t>(Op.getImm()));
> >     if (SubIdx == AMDGPU::sub1)
> >       return MachineOperand::CreateImm(static_cast<int32_t>(Op.getImm() >> 32));
> > 
> >     llvm_unreachable("Unhandled register index for immediate");
> >   }
> > ```
> > Once again, I maybe don't understand what your objection is about.
> > 
> > For the simple i64 immediate addition like this:
> > 
> > ```
> >   %add = add i64 20015998343286, %a
> > ```
> > we generate carry out:
> > 
> > ```
> > 	s_add_u32 s0, s2, 0x56789876
> > 	s_addc_u32 s1, s3, 0x1234
> > ```
> > for uniform
> > 
> > or 
> > ```
> > 	v_add_co_u32_e32 v0, vcc, 0x56789876, v0
> > 	v_mov_b32_e32 v1, 0x1234
> > 	v_addc_co_u32_e32 v1, vcc, 0, v1, vcc
> > 
> > ```
> > for divergent.
> > 
> > So, why do we need VReg_64?
> > 
> > 
> You are calling TRI->getSubRegClass(Src1RC, AMDGPU::sub1); on this RC. You want to have VGPR_32 as an answer. Even though getSubRegClass() may return RC itself if it does not have a requested subreg this sounds like a bug. It would be more natural for it to assert. To be on a safe side pass there VReg_64 to get the same VGPR_32 or just do not call it if it is an immediate.
Fine, we have one weird piece - SIRegisterInfo::getSubRegClass that returns input argument back if does not succeed.
Then we have one more weird piece - SIInstrInfo::buildExtractSubRegOrImm that is in fact 2 separate functions. It does completely different things for register and immediate, and just ignores register class arguments for immediate. Unfortunately all arguments required. So, I have to pass to it register class and sub-register class despite of the fact they're not used.
Refactoring these 2 pieces should be separate change.
If you insist that using VGPR_32RegClass is misleading I have no choice but VReg_64RegClass, that is misleading either IMHO.
I'll have to add a FIXME comment to explain why we use VGPR_32RegClass for immediate that can be 32bit only.

================
Comment at: llvm/lib/Target/AMDGPU/SIISelLowering.cpp:3688
+    Register CarryReg = MRI.createVirtualRegister(CarryRC);
+    Register DeadCarryReg = MRI.createVirtualRegister(CarryRC);
+
----------------
rampitec wrote:
> You can probably use SIInstrInfo::getAddNoCarry() and extend it to produce sub as well or create e new helper. You are always using I32 version even if a no-carry U32 version is available.
Not sure which line this comment belong...
For LoOpc I indeed need carry out opcode.
getAddNoCarry returnc the addition that does not write the carry flag.

Anyway, the exact opcode is selected later on by the SIInstrInfo::pseudoToMCOpcode

```
  renamable $vgpr0 = V_ADD_I32_e32 1450743926, killed $vgpr0, implicit-def $vcc, implicit $exec
  renamable $vgpr1 = V_MOV_B32_e32 4660, implicit $exec
  renamable $vgpr1 = V_ADDC_U32_e32 0, killed $vgpr1, implicit-def $vcc, implicit killed $vcc, implicit $exec
```
 turns to the 

```
	v_add_co_u32_e32 v0, vcc, 0x56789876, v0
	v_mov_b32_e32 v1, 0x1234
	v_addc_co_u32_e32 v1, vcc, 0, v1, vcc

```
for gfx9

but to the

```
	v_add_i32_e32 v0, vcc, 0x56789876, v0
	v_mov_b32_e32 v1, 0x1234
	v_addc_u32_e32 v1, vcc, 0, v1, vcc

```
for gfx6

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D78091/new/

https://reviews.llvm.org/D78091