[PATCH] D57737: [AMDGPU] Fix DPP sequence in atomic optimizer.

Mon Feb 11 03:49:51 PST 2019

sheredom marked an inline comment as done.
sheredom added a comment.

In D57737#1392667 <https://reviews.llvm.org/D57737#1392667>, @tpr wrote:

> But I still don't understand it:
>
> 1. Why do you want an exclusive scan? Surely what you're trying to do is just "sum" up all lanes into lane 63, which is an inclusive scan.

No we need both - you need an exclusive scan to know the individual lane's position in the single atomic operation, but you also need the reduction (sum) to know how much we are adding in the atomic operation in the first place. So I do an exclusive scan to get our position, then add on our own lane's value to get the inclusive scan, and readlane 63 to get the total reduction across the wavefront.

> 2. Can't you do an exclusive scan with powers of 2 shifts like an inclusive scan, but just with the wf_sr1 on the front? (Although I think that gives the wrong answer due to (1)).

No you need to do the shift 1,2,3 - can see the reasoning why in here https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

> 3. Isn't the only thing wrong with this code before this fix that you forgot to put the bank masks on steps 2, 3 and 4? (Although you're correct to remove the unnecessary intermediate wwm intrinsic calls.)

There was two bugs:

1. The exclusive scan component returned the wrong result in some cases because the shift-by-3 wasn't there.
2. We need to do a ballot instead of read_register because read_register was sometimes being misidentified as requiring WWM and thus giving us garbage results for the atomc load.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D57737/new/

https://reviews.llvm.org/D57737