[PATCH] D79925: [SystemZ] Eliminate the need to create a zero vector by reusing the mask.

Mon May 18 08:34:41 PDT 2020

uweigand added a comment.

> It seems that now MachineCSE can remove a few VPERMs also. At least one case was because instead of using an undef source operand, the mask was used (it doesn't help any to replace an undef Ops[1] with Op2 on the last line of getGeneralPermuteNode(), which I first thought). The test case I have reduced (not included) shows that MCSE on trunk fails to remove the second vperm if even though the instructions are near identical. The only difference to the success with this patch is that instead of the reused mask, there is an IMPLICIT_DEF. It looks to me that this is a minor deficiency of MCSE (Two different vregs defined by IMPLICIT_DEF should not stop CSE, I would think).
> 
> isZeroOrUndefVector(): The check for the undef vector used to be beneficial for the zero_extend_vector_inreg patch, but the way it looks now (just using a single unpack), it is not needed for anymore (NFC). I think it could be removed or we can keep it here to get the few less vperms...
> 
> Two tests - one for each case handled. Not quite sure what happened with vector-constrained-fp-intrinsics.ll - I don't understand why v0 is no longer used, but it seems that this patch begins to change things when the second operand of the VPERM is undefined.

It seems to me these are related.  Currently, for a VPERM that only needs a single input, the second input is set to UNDEF, which gets lowered to an IMPLICIT_DEF at the MI level, which the register allocator just replaced by some (random?) register.  In the test case, it chooses %v0, which has nothing to do with the VPERM, and is in fact already used for an unrelated operation.  This is not really a good idea, because that creates a register dependency on the instruction that is really unnecessary ...

I think for VPERMs that only need a single input, we should simply use the same register for both input slots.  That way, the contents are well-defined, and there are no spurious dependencies.  Also, it should fix the CSE problem, I think.  (Using the other input seems preferable to using the index vector ...)

If we do that, it follows that a completely undefined input should *not* be treated as a zero vector for this optimization.  (However, a not fully undefined vector that is a mixture of zero elements and undefined elements can be treated as zero vector.  Not sure if this makes much of a difference in practice.)

Note that fixing the handling of an UNDEF second input can (and probably should) be done as a separate patch, before this optimization is merged.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79925/new/

https://reviews.llvm.org/D79925