[PATCH] D116270: [AMDGPU] Enable divergence-driven XNOR selection

Mon Jan 10 07:37:13 PST 2022

alex-t added a comment.

In D116270#3230834 <https://reviews.llvm.org/D116270#3230834>, @foad wrote:

> In D116270#3209307 <https://reviews.llvm.org/D116270#3209307>, @alex-t wrote:
>
>> This looks like a regression in xnor.ll :
>>
>>   	s_not_b32 s0, s0                        	v_not_b32_e32 v0, v0
>>   	v_xor_b32_e32 v0, s0, v0                        v_xor_b32_e32 v0, s4, v0
>>
>> but it is not really.  All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
>> S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
>> I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.
>
> To repeat what I have already said elsewhere: this is **not** a correctness issue. This is just an optimization, where you can choose to calculate either `~s0 ^ v0` or `s0 ^ ~v0` (or `~(s0 ^ v0)`) and get exactly the same result. The optimization is to prefer the first form, because the intermediate result `~s0` is uniform, so you can keep it in an sgpr and not waste vgprs and valu instructions.

In D116270#3230743 <https://reviews.llvm.org/D116270#3230743>, @foad wrote:

>> SITargetLowering::reassociateScalarOps exists to fix the instruction selection that is done in a wrong way.
>
> No! It's not trying to fix anything, it's just trying to reassociate expressions to keep more of the intermediate results uniform, so we use fewer vgprs and fewer valu instructions. For example:
>
>   // v1 = (v0 + s0) + s1
>   v_add v1, v0, s0
>   v_add v1, v1, s1
>    -->
>   // v1 = (s0 + s1) + v0 ; reassociated
>   s_add s2, s0, s1
>   v_add v1, s2, v0
>
> This is exactly the same kind of thing you need to do to restore the missed optimization in xnor.ll:
>
>   // v1 = ~(s0 ^ v0)
>   v_xor v1, s0, v0
>   v_not v1, v1
>    -->
>   // v1 = ~s0 ^ v0
>   s_not s1, s0
>   v_xor v1, s1, v0

I am sorry. I have really misled you by confusing the SITargetLowering::reassociateScalarOps with SIInstrInfo::lowerScalarXnor that we have recently discussed with @rampitec.
The SITargetLowering::reassociateScalarOps in our case does nothing because both XOR and NOT nodes are divergent.  
The difference between what we had before my change to the selection and what we have now is following:
Before:

  We selected both XOR and NOT to S_XOR_B32 and S_NOT_B32 and then optimized the whole pattern in SIInstrInfo::lowerScalarXnor and passed it to moveToVALU.

Now:

  We select the divergent NOT to V_NOT_B32_e32 and divergent XOR to V_XOR_B32_e64. The selection is correct but we missed the opportunity to exploit the fact that even divergent NOT may be selected to S_NOT_B32 w/o the correctness lost.

During the selection process, we cannot predict if the divergent NOT operands are SGPRs because we are walking the DAG postorder and the node operands are not yet selected.
So, we only can apply the optimization AFTER the selection is done.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D116270/new/

https://reviews.llvm.org/D116270