[PATCH] AMDGPU/SI: Better handle s_wait insertion

Sat Jul 25 14:50:00 PDT 2015

> On Jul 25, 2015, at 11:13 AM, Axel Davy <axel.davy at ens.fr> wrote:
> 
> We can wait on either VM, EXP or LGKM.
> The waits are independent.
> 
> Without this patch, a wait inserted because of one of them
> would also wait for all the previous others.
> This patch makes s_wait only wait for the ones we need for the next instruction.
> 
> Here's an example of subtle perf reduction this patch solves:
> 
> This is without the patch:
> 
> buffer_load_format_xyzw v[8:11], v0, s[44:47], 0 idxen
> buffer_load_format_xyzw v[12:15], v0, s[48:51], 0 idxen
> s_load_dwordx4 s[44:47], s[8:9], 0xc
> s_waitcnt lgkmcnt(0)
> buffer_load_format_xyzw v[16:19], v0, s[52:55], 0 idxen
> s_load_dwordx4 s[48:51], s[8:9], 0x10
> s_waitcnt vmcnt(1)
> buffer_load_format_xyzw v[20:23], v0, s[44:47], 0 idxen

Can you come up with a testcase that looks like this? Liberal use of volatile will probably help construct one that isn’t too fragile

> 
> The s_waitcnt vmcnt(1) is useless.
> The reason it is added is because the last
> buffer_load_format_xyzw needs s[44:47], which was issued
> by the first s_load_dwordx4. It waits for all VM
> before that call to have finished.
> 
> Internally after every instruction, 3 counters (for VM, EXP and LGTM)
> are updated after every instruction. For example buffer_load_format_xyzw will
> increase the VM counter, and s_load_dwordx4 the LGKM one.
> 
> Without the patch, for every defined register,
> the current 3 counters are stored, and are used to know
> how long to wait when an instruction needs the register.
> 
> Because of that, the s[44:47] counter includes that to use the register
> you need to wait for the previous buffer_load_format_xyzw.
> 
> Instead this patch stores only the counters that matter for the register,
> and puts zero for the other ones, since we don't need any wait for them.

This is a useful description of the problem. Can you put some of this in a comment?

> ---
> lib/Target/AMDGPU/SIInsertWaits.cpp | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/Target/AMDGPU/SIInsertWaits.cpp b/lib/Target/AMDGPU/SIInsertWaits.cpp
> index 90a37f1..b08851f 100644
> --- a/lib/Target/AMDGPU/SIInsertWaits.cpp
> +++ b/lib/Target/AMDGPU/SIInsertWaits.cpp
> @@ -246,10 +246,13 @@ void SIInsertWaits::pushInstruction(MachineBasicBlock &MBB,
> 
>   // Get the hardware counter increments and sum them up
>   Counters Increment = getHwCounts(*I);
> +  Counters Limit = ZeroCounts;
>   unsigned Sum = 0;
> 
>   for (unsigned i = 0; i < 3; ++i) {
>     LastIssued.Array[i] += Increment.Array[i];
> +    if (Increment.Array[i])
> +      Limit.Array[i] = LastIssued.Array[i];
>     Sum += Increment.Array[i];
>   }

I’m not particularly familiar with how this pass works, but why is this LastIssued += Increment? The comment for LastIssued says
" Counter values for last instruction issued.”, but adding the new instructions counters seems to imply this is a global counter of the currently outstanding waits. Should this be renamed to something like CurrentIssued?

> 
> @@ -300,11 +303,11 @@ void SIInsertWaits::pushInstruction(MachineBasicBlock &MBB,
> 
>       // Remember which registers we define
>       if (Op.isDef())
> -        DefinedRegs[j] = LastIssued;
> +        DefinedRegs[j] = Limit;
> 
>       // and which one we are using
>       if (Op.isUse())
> -        UsedRegs[j] = LastIssued;
> +        UsedRegs[j] = Limit;
>     }
>   }
> }