[PATCH] D25998: AMDGPU/SI: Don't use non-0 waitcnt values when waiting on Flat instructions

Wed Nov 2 00:01:35 PDT 2016

tony-tye added a comment.

Feedback on overall pass.

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:54

 typedef Counters RegCounters[512];
 typedef std::pair<unsigned, unsigned> RegInterval;
----------------
Is there a named constant for the maximum register number rather than using 512?

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:88

   /// \brief Different export instruction types seen since last wait.
   unsigned ExpInstrTypesSeen;
----------------
Would be helpful to state what the bits mean. It seems 1 is EXPORT and 2 is MEM-WRITE and perhaps have an enumeration that is used.

Would need to add 4 for GDS when supported.

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:137

   /// Return true if there are LGKM instrucitons that haven't been waited on
   /// yet.
----------------
instrucitons -> instructions 

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:197
   Result.Named.EXP = !!(TSFlags & SIInstrFlags::EXP_CNT &&
       (MI.getOpcode() == AMDGPU::EXP || MI.getDesc().mayStore()));

----------------
Only GFX6 uses exp_cnt for stores.

Later targets do not increment this count, but stores of more than 2 dwords have a hardware hazard that requires at lease one instruction between the store and the next write of the register.

So EXP_CNT property should only be put on M*BUF instructions for GFX6 and not later.

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:300-301

+  if (TII->isFLAT(*I))
+    IsFlatOutstanding = true;
+
----------------
If the flat operation is known not to access LDS then it cannot return out of order. For example, flat is used in 64 bit to access the global address space. So wonder if should also check the address space of the operation and only do this if the address space is FLAT (and not when GLOBAL)?

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:316

   if (ST->getGeneration() >= SISubtarget::VOLCANIC_ISLANDS) {
     // Any occurrence of consecutive VMEM or SMEM instructions forms a VMEM
----------------
Should this be querying a subtarget feature instead of a specific target generation? The feature here seems to be that soft clauses are supported.

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:320-325
     // The temporary workaround is to break the clauses with S_NOP.
     //
     // The proper solution would be to allocate registers such that all source
     // and destination registers don't overlap, e.g. this is illegal:
     //   r0 = load r2
     //   r2 = load r0
----------------
This comment indicates that both SMEM and VMEM clauses must be broken, but the following code only handles VMEM as SMEM is handled elsewhere.

The rules for VMEM only have to be followed when XNACK is supported. However, the rules for SMEM need to be followed regardless of whether XNACK is enabled as SMEM operations can complete out of order.

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:326-331
     if (LastOpcodeType == VMEM && Increment.Named.VM) {
       // Insert a NOP to break the clause.
       BuildMI(MBB, I, DebugLoc(), TII->get(AMDGPU::S_NOP))
           .addImm(0);
       LastInstWritesM0 = false;
     }
----------------
Don't VMEM clauses only have to ensure input registers are not modified inside the clause when XNACK is being supported? We now have a subtarget feature to indicate that so should that be used here instead of checking the generation? So should this NOP insertion only be done when the XNACK feature is enabled?

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:377-378

-  // VM_CNT is always ordered
-  Ordered[0] = true;
+  // VM_CNT is always ordered except when there are flat instructions, which
+  // can return out of order.
+  Ordered[0] = !IsFlatOutstanding;
----------------
This is conservatively correct. But a better approach would be to not increment the vmcnt for flat instructions that are to generic address space, and record in the DefinedRegs of the destination as maxint. That would allow non-0 vmcnt for using registers produced by non flat instructions (it would be conservative as the value would assume the flat may have completed early), and 0 for the result of the flat instruction.

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:381

   // EXP_CNT is unordered if we have both EXP & VM-writes
   Ordered[1] = ExpInstrTypesSeen == 3;
----------------
Is this still true with current hardware? Pre-SI I think this was the case, but I thought SI onwards no longer used the export counter for VMEM instructions?

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:97
+  /// Whether or not we have flat operations outstanding.
+  bool IsFlatOutstanding;
+
----------------
Given that this is only for flat instructions that can complete early, not all flat, should this be renamed?

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:193

   Result.Named.VM = !!(TSFlags & SIInstrFlags::VM_CNT);

----------------
Are BUFFER_CACHE_INV* marked as updating vmcnt?
Are FLAT* marked as updating vmcnt?
Are GDS instructions marked as lgkmcnt and expcnt?
GDS needs waitcnt 0 before EXEC can be updated.

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:199

   // LGKM may uses larger values
   if (TSFlags & SIInstrFlags::LGKM_CNT) {
----------------
// LGKM counters may be incremented by more than 1.

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:200
   // LGKM may uses larger values
   if (TSFlags & SIInstrFlags::LGKM_CNT) {

----------------
Are S_DCACHE_* marked as updating lgkmcnt?
Are FLAT* marked as updating lgkmcnt?

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:202

     if (TII->isSMRD(MI)) {

----------------
This check should also apply to scalar writes.

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:216
         // XXX - What is the right value?
         Result.Named.LGKM = 1;
       }
----------------
The scalar data cache invalidate and writeback instructions do not affect the lgkm counter so should not be marked in the td file as affecting the counter.

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:224
   } else {
     Result.Named.LGKM = 0;
   }
----------------
Why is this needed as Result is initialized to 0?

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:241
   MachineInstr &MI = *Op.getParent();
   if (MI.getOpcode() == AMDGPU::EXP)
     return true;
----------------
also GDS

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:246
   if (!MI.getDesc().mayStore())
     return false;

----------------
Are any source operands that are registers with a counter value that has not yet been satisfied (ie counter < value already waitedOn)?

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:268-275
   // NOTE: This assumes that the value operand is before the
   // address operand, and that there is only one value operand.
   for (MachineInstr::mop_iterator I = MI.operands_begin(),
        E = MI.operands_end(); I != E; ++I) {

     if (I->isReg() && I->isUse())
       return Op.isIdenticalTo(*I);
----------------
Only GFX6 requires to use the expcnt to determine if the input value is InUse.
There is also a hardware hazard if input is larger than N dwords which requires M instructions before register can be used as a destination (is that hazard checked?).

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:333

     if (TII->isSMRD(*I))
       LastOpcodeType = SMEM;
----------------
Should this also include scalar writes?

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:333-337
     if (TII->isSMRD(*I))
       LastOpcodeType = SMEM;
     else if (Increment.Named.VM)
       LastOpcodeType = VMEM;
   }
----------------
tony-tye wrote:
> Should this also include scalar writes?
Why is this if nested inside the enclosing if? Seems tracking the lastOpcodeType should be done regardless of breaking the soft clauses for consistency.

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:341
   if (Increment.Named.EXP) {
     ExpInstrTypesSeen |= I->getOpcode() == AMDGPU::EXP ? 1 : 2;
   }
----------------
Add another bit for GDS.
Exports are kept in order only within each export type (color/null, position, parameter cache) so need separate bits.

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:359
       if (Op.isUse())
         UsedRegs[j] = Limit;
     }
----------------
For UsedRegs only the expcnt needs to be waited on before the register is available. For VMEM store in GFX6 both vmcnt and expcnt will be present in Limit; for GDS both expcnt and lgkm willbe present in Limit. So should this just update the expcnt?

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:382
   // EXP_CNT is unordered if we have both EXP & VM-writes
   Ordered[1] = ExpInstrTypesSeen == 3;

----------------
Should this be != 3? If both are seen then it will be 3, so 3 means it is NOT ordered, not that it IS ordered?
If adding other bits for GDS and export types then better to use BitCount(ExpInstrTypesSeen ) == 1

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:385
   // LGKM_CNT is handled as always unordered. TODO: Handle LDS and GDS
   Ordered[2] = false;

----------------
Currently the LGKM counter is always assumed unordered but this could be improved by tracking the classes of instruction that update it (as is done for EXP_CNT) and then can use non-0 waitcnt when only a single class of instructions have been seen since the last waitcnt for LGKM. This would potentially benefit the DS_* instructions greatly.

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:401
     if (Ordered[i]) {
       unsigned Value = LastIssued.Array[i] - Required.Array[i];

----------------
If Required is 0 then no wait is needed on this counter so Value should be set to Hardware limit. Only if Required is non 0 does it mean that there is an instruction in this BB that we must wait on.

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:459
   for (unsigned i = 0; i < 3; ++i) {
     if (Counts.Array[i] <= LastIssued.Array[i])
       WaitOn.Array[i] = LastIssued.Array[i] - Counts.Array[i];
----------------
Need delayed waitcnt if Counts is trying to wait on an instruction after the WaitedOn. So this should be:

if (Counts.Array[i] < LastIssued.Array[i] - WaitedOn.Array[i])

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:487
       if (Op.isDef()) {
         increaseCounters(Result, UsedRegs[j]);
         increaseCounters(Result, DefinedRegs[j]);
----------------
Only the expcnt should be considered. When UseRegs is set it includes all counters and we do not need to wait for a GFX6 store to complete before being able to use the source register. That only has to be waited for before using the destination register.

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:501
                                   MachineBasicBlock::iterator I) {
   if (ST->getGeneration() < SISubtarget::VOLCANIC_ISLANDS)
     return;
----------------
Should this be a target feature?

================
Comment at: lib/Target/AMDGPU/SIInsertWaits.cpp:558

       if (ST->getGeneration() <= SISubtarget::SEA_ISLANDS) {
         // There is a hardware bug on CI/SI where SMRD instruction may corrupt
----------------
Seem better if this was a target feature that was tested.

Repository:
  rL LLVM

https://reviews.llvm.org/D25998