[PATCH] AMDGPU/SI: Better handle s_wait insertion
Matt Arsenault
arsenm2 at gmail.com
Sat Jul 25 14:50:00 PDT 2015
> On Jul 25, 2015, at 11:13 AM, Axel Davy <axel.davy at ens.fr> wrote:
>
> We can wait on either VM, EXP or LGKM.
> The waits are independent.
>
> Without this patch, a wait inserted because of one of them
> would also wait for all the previous others.
> This patch makes s_wait only wait for the ones we need for the next instruction.
>
> Here's an example of subtle perf reduction this patch solves:
>
> This is without the patch:
>
> buffer_load_format_xyzw v[8:11], v0, s[44:47], 0 idxen
> buffer_load_format_xyzw v[12:15], v0, s[48:51], 0 idxen
> s_load_dwordx4 s[44:47], s[8:9], 0xc
> s_waitcnt lgkmcnt(0)
> buffer_load_format_xyzw v[16:19], v0, s[52:55], 0 idxen
> s_load_dwordx4 s[48:51], s[8:9], 0x10
> s_waitcnt vmcnt(1)
> buffer_load_format_xyzw v[20:23], v0, s[44:47], 0 idxen
Can you come up with a testcase that looks like this? Liberal use of volatile will probably help construct one that isn’t too fragile
>
> The s_waitcnt vmcnt(1) is useless.
> The reason it is added is because the last
> buffer_load_format_xyzw needs s[44:47], which was issued
> by the first s_load_dwordx4. It waits for all VM
> before that call to have finished.
>
> Internally after every instruction, 3 counters (for VM, EXP and LGTM)
> are updated after every instruction. For example buffer_load_format_xyzw will
> increase the VM counter, and s_load_dwordx4 the LGKM one.
>
> Without the patch, for every defined register,
> the current 3 counters are stored, and are used to know
> how long to wait when an instruction needs the register.
>
> Because of that, the s[44:47] counter includes that to use the register
> you need to wait for the previous buffer_load_format_xyzw.
>
> Instead this patch stores only the counters that matter for the register,
> and puts zero for the other ones, since we don't need any wait for them.
This is a useful description of the problem. Can you put some of this in a comment?
> ---
> lib/Target/AMDGPU/SIInsertWaits.cpp | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/lib/Target/AMDGPU/SIInsertWaits.cpp b/lib/Target/AMDGPU/SIInsertWaits.cpp
> index 90a37f1..b08851f 100644
> --- a/lib/Target/AMDGPU/SIInsertWaits.cpp
> +++ b/lib/Target/AMDGPU/SIInsertWaits.cpp
> @@ -246,10 +246,13 @@ void SIInsertWaits::pushInstruction(MachineBasicBlock &MBB,
>
> // Get the hardware counter increments and sum them up
> Counters Increment = getHwCounts(*I);
> + Counters Limit = ZeroCounts;
> unsigned Sum = 0;
>
> for (unsigned i = 0; i < 3; ++i) {
> LastIssued.Array[i] += Increment.Array[i];
> + if (Increment.Array[i])
> + Limit.Array[i] = LastIssued.Array[i];
> Sum += Increment.Array[i];
> }
I’m not particularly familiar with how this pass works, but why is this LastIssued += Increment? The comment for LastIssued says
" Counter values for last instruction issued.”, but adding the new instructions counters seems to imply this is a global counter of the currently outstanding waits. Should this be renamed to something like CurrentIssued?
>
> @@ -300,11 +303,11 @@ void SIInsertWaits::pushInstruction(MachineBasicBlock &MBB,
>
> // Remember which registers we define
> if (Op.isDef())
> - DefinedRegs[j] = LastIssued;
> + DefinedRegs[j] = Limit;
>
> // and which one we are using
> if (Op.isUse())
> - UsedRegs[j] = LastIssued;
> + UsedRegs[j] = Limit;
> }
> }
> }
More information about the llvm-commits
mailing list