[llvm] r233689 - [SystemZ] Use POPCNT instruction on z196

Tue Mar 31 07:12:48 PDT 2015

On 31 March 2015 at 13:56, Ulrich Weigand <ulrich.weigand at de.ibm.com> wrote:
>
> Author: uweigand
> Date: Tue Mar 31 07:56:33 2015
> New Revision: 233689
>
> URL: http://llvm.org/viewvc/llvm-project?rev=233689&view=rev
> Log:
> [SystemZ] Use POPCNT instruction on z196
>
> We already exploit a number of instructions specific to z196,
> but not yet POPCNT.  Add support for the population-count
> facility, MC support for the POPCNT instruction, CodeGen
> support for using POPCNT, and implement the getPopcntSupport
> TargetTransformInfo hook.
>
>
> Added:
>     llvm/trunk/test/CodeGen/SystemZ/ctpop-01.ll
> Modified:
>     llvm/trunk/lib/Target/SystemZ/SystemZISelLowering.cpp
>     llvm/trunk/lib/Target/SystemZ/SystemZISelLowering.h
>     llvm/trunk/lib/Target/SystemZ/SystemZInstrInfo.td
>     llvm/trunk/lib/Target/SystemZ/SystemZOperators.td
>     llvm/trunk/lib/Target/SystemZ/SystemZProcessors.td
>     llvm/trunk/lib/Target/SystemZ/SystemZSubtarget.cpp
>     llvm/trunk/lib/Target/SystemZ/SystemZSubtarget.h
>     llvm/trunk/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
>     llvm/trunk/lib/Target/SystemZ/SystemZTargetTransformInfo.h
>     llvm/trunk/test/MC/Disassembler/SystemZ/insns.txt
>     llvm/trunk/test/MC/SystemZ/insn-bad.s
>     llvm/trunk/test/MC/SystemZ/insn-good-z196.s
>
> Modified: llvm/trunk/lib/Target/SystemZ/SystemZISelLowering.cpp
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/SystemZ/SystemZISelLowering.cpp?rev=233689&r1=233688&r2=233689&view=diff
> ==============================================================================
> --- llvm/trunk/lib/Target/SystemZ/SystemZISelLowering.cpp (original)
> +++ llvm/trunk/lib/Target/SystemZ/SystemZISelLowering.cpp Tue Mar 31 07:56:33 2015
> @@ -163,8 +163,13 @@ SystemZTargetLowering::SystemZTargetLowe
>        // available, or if the operand is constant.
>        setOperationAction(ISD::ATOMIC_LOAD_SUB, VT, Custom);
>
> +      // Use POPCNT on z196 and above.
> +      if (Subtarget.hasPopulationCount())
> +        setOperationAction(ISD::CTPOP, VT, Custom);
> +      else
> +        setOperationAction(ISD::CTPOP, VT, Expand);
> +
>        // No special instructions for these.
> -      setOperationAction(ISD::CTPOP,           VT, Expand);
>        setOperationAction(ISD::CTTZ,            VT, Expand);
>        setOperationAction(ISD::CTTZ_ZERO_UNDEF, VT, Expand);
>        setOperationAction(ISD::CTLZ_ZERO_UNDEF, VT, Expand);
> @@ -2304,6 +2309,45 @@ SDValue SystemZTargetLowering::lowerOR(S
>                                     MVT::i64, HighOp, Low32);
>  }
>
> +SDValue SystemZTargetLowering::lowerCTPOP(SDValue Op,
> +                                          SelectionDAG &DAG) const {
> +  EVT VT = Op.getValueType();
> +  int64_t OrigBitSize = VT.getSizeInBits();
> +  SDLoc DL(Op);
> +
> +  // Get the known-zero mask for the operand.
> +  Op = Op.getOperand(0);
> +  APInt KnownZero, KnownOne;
> +  DAG.computeKnownBits(Op, KnownZero, KnownOne);
> +  uint64_t Mask = ~KnownZero.getZExtValue();
> +
> +  // Skip known-zero high parts of the operand.
> +  int64_t BitSize = OrigBitSize;
> +  while ((Mask & ((((uint64_t)1 << (BitSize / 2)) - 1) << (BitSize / 2))) == 0)
> +    BitSize = BitSize / 2;

This will loop forever if all bits are known to be zero, won't it?

To avoid looping, how about:

unsigned NumSignificantBits = (~KnownZero).getActiveBits();
unsigned BitSize = 1U << Log2_32_Ceil(NumSignificantBits);

(But you still need to defend against all bits being zero.)

> +
> +  // The POPCNT instruction counts the number of bits in each byte.
> +  Op = DAG.getNode(ISD::ANY_EXTEND, DL, MVT::i64, Op);
> +  Op = DAG.getNode(SystemZISD::POPCNT, DL, MVT::i64, Op);
> +  Op = DAG.getNode(ISD::TRUNCATE, DL, VT, Op);
> +
> +  // Add up per-byte counts in a binary tree.  All bits of Op at
> +  // position larger than BitSize remain zero throughout.
> +  for (int64_t I = BitSize / 2; I >= 8; I = I / 2) {
> +    SDValue Tmp = DAG.getNode(ISD::SHL, DL, VT, Op, DAG.getConstant(I, VT));
> +    if (BitSize != OrigBitSize)
> +      Tmp = DAG.getNode(ISD::AND, DL, VT, Tmp,
> +                        DAG.getConstant(((uint64_t)1 << BitSize) - 1, VT));
> +    Op = DAG.getNode(ISD::ADD, DL, VT, Op, Tmp);
> +  }
> +
> +  // Extract overall result from high byte.
> +  if (BitSize > 8)
> +    Op = DAG.getNode(ISD::SRL, DL, VT, Op, DAG.getConstant(BitSize - 8, VT));

For a 64-bit value where the high 32 bits are known to be zero you'll generate:

Op = POPCNT(Op);
Tmp = Op << 16;
Tmp &= 0xFFFFFFFF;
Op += Tmp;
Tmp = Op << 8;
Tmp &= 0xFFFFFFFF;
Op += Tmp;
Op >>= 24;

Instead of doing an AND at every loop iteration, how about generating:

Op = POPCNT(Op);
Tmp = Op >> 16;
Op += Tmp;
Tmp = Op >> 8;
Op += Tmp;
Op &= 0xFF;

I.e. SRL instead of SHL inside the loop, and AND instead of SRL to
extract the overall result.

Jay.