[PATCH] D55314: AMDGPU: Turn on the DPP combiner by default

Thu Dec 6 06:08:22 PST 2018

cwabbott added a comment.

In D55314#1321222 <https://reviews.llvm.org/D55314#1321222>, @vpykhtin wrote:

> In D55314#1321167 <https://reviews.llvm.org/D55314#1321167>, @hakzsam wrote:
>
> > Hi,
> >
> > This change breaks most of the subgroups tests with RADV (ie. dEQP-VK.subgroups.arithmetic.*).
> >
> > Any reasons why you enabled it by default? Looks like it now triggers a new bug in the AMDGPU backend.
> >
> > Thanks!
>
>
> Hi,
>
> this is awaited feature and other reason is to detect a situation such yours. I think it's better to turn it off and reproduce the failures: how can I do this?

You'll need to build Mesa with radv enabled against your LLVM, then run the Vulkan CTS against it, specifically dEQP-VK.subgroups.arithmetic.* . AMDVLK might also work, since they'll emit similar code. I would suggest using the new atomic optimizer code in LLVM, but it seems it's already broken. It uses llvm.amdgcn.mov.dpp instead of llvm.amdgcn.update.dpp which means the lanes that aren't written are undefined, so if you get unlucky in register allocation it won't work, regardless of whether your pass runs or not.

I looked at your pass, and there are a few problems with it. I think we should revert this commit until the pass has been properly rewritten.

- It doesn't check for the case where the DPP mov and use are in different basic blocks with potentially different EXEC. This should be easy to fix.
- It doesn't handle the non-full row_mask or bank_mask case. If either one doesn't have the default value of 0xffff, then some lanes are not written, regardless of what bound_ctrl is set to. So if there's such a move that feeds into, say, an addition, then the old value needs to be added for the disabled lanes, which blindly copying over the row_mask and bank_mask and setting old to undef won't do.
- This is fundamentally not doing what frontends implementing reductions actually want. In practice, when you're using DPP to implement a wavefront reduction correctly, every DPP operation looks like this:

  %tmp = V_MOV_B32_dpp identity, %a, ... ; from llvm.amdgcn.update.dpp
  %out = op %tmp, %b

where "identity" is the identity for the operation (0 for add, -1 for unsigned max, etc.), which is the old value for the move. This can be optimized to

  %out = op_dpp %b, %a, %b, ...

regardless of what the DPP flags are, as long as bound_ctrl:0 is not set. This is what we should actually be optimizing.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55314/new/

https://reviews.llvm.org/D55314