[PATCH] D78091: [AMDGPU] Enable carry out ADD/SUB operations divergence driven instruction selection.

Thu Apr 23 11:21:37 PDT 2020

alex-t added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp:1086
+  if (!IsVALU) {
+    for (auto Use : N->uses()) {
+      if (Use->isMachineOpcode() && TII->isVALU(Use->getMachineOpcode())) {
----------------
arsenm wrote:
> alex-t wrote:
> > arsenm wrote:
> > > Shouldn't need to scan all uses?
> > I assume that the one user selected to VALU is enough to make selecting carryout to SALU impractical. Also, how would you suggest to decide about VALU/SALU basing on the quantity of VALU users? Ratio VALU/SALU? Other heuristic?
> There's already  a isVGPRImm function, which should be functionally the same
Not really. isVGPRImm walks along the use list and check if each use can accept SGPR or not. It also tries to commute operands if possible to make  SGPR acceptable. The main goal is to select SGPR instead of VGPR to keep immediate whenever possible. This works in pair with the SIFoldOperands.

Here we try to workaround the following sub-optimal selection pattern:
```
	s_sub_i32 s0, s0, s1
	s_cselect_b64 s[0:1], 1, 0
	v_cndmask_b32_e64 v0, 0, 1, s[0:1]
```

that appears instead of the:

```
	v_sub_co_u32_e32 v4, vcc, s0, v4
	v_cndmask_b32_e64 v5, 0, 1, vcc
```
We need s_cselect to copy SCC to SReg_64.
In this sense, any explicit use of the conditional register pair leads to one extra instruction in case of uniform UADDO/USUBO.
Same time it saves one VGPR. 
The only case that does not produce extra instruction is true carry out operations where the only user of the UADDO/USUBO value 2 is the ADD/SUBCARRY node.
So, I decide to change this for explicit check. This is over-conservative but it is okay for now.

================
Comment at: llvm/lib/Target/AMDGPU/SIISelLowering.cpp:3698
+                                            ? MRI.getRegClass(Src1.getReg())
+                                            : &AMDGPU::VGPR_32RegClass;
+    const TargetRegisterClass *Src0SubRC =
----------------
rampitec wrote:
> alex-t wrote:
> > rampitec wrote:
> > > VReg_64? Since it did not fail anywhere this case must be not covered by any tests.
> > I maybe misunderstand the documentation, but it says that the size we only can have 32bit immediate aa operand.
> > I also did some experiments with different targets (gfx600,900,1010) and always have seen that 64bit size constant was split into 2 32bit parts for addition.
> > Please correct me if I understand it in a wrong way.
> Two lines below you are asking for sub0 of that RC. VGPR_32 does not have sub0.
Yes. And it is exactly what is expected :) 
getSubRegClass   returns VGPR_32RegClass itself in this case. In fact id does not matter what it is.
buildExtractSubRegOrImm does not use SubRC argument if operand is immediate.

```
 if (Op.isImm()) {
    if (SubIdx == AMDGPU::sub0)
      return MachineOperand::CreateImm(static_cast<int32_t>(Op.getImm()));
    if (SubIdx == AMDGPU::sub1)
      return MachineOperand::CreateImm(static_cast<int32_t>(Op.getImm() >> 32));

    llvm_unreachable("Unhandled register index for immediate");
  }
```
Once again, I maybe don't understand what your objection is about.

For the simple i64 immediate addition like this:

```
  %add = add i64 20015998343286, %a
```
we generate carry out:

```
	s_add_u32 s0, s2, 0x56789876
	s_addc_u32 s1, s3, 0x1234
```
for uniform

or 
```
	v_add_co_u32_e32 v0, vcc, 0x56789876, v0
	v_mov_b32_e32 v1, 0x1234
	v_addc_co_u32_e32 v1, vcc, 0, v1, vcc

```
for divergent.

So, why do we need VReg_64?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D78091/new/

https://reviews.llvm.org/D78091