[llvm] [AArch64] Generate rev16 for certain uses of __builtin_bswap16 (PR #105375)

Wed Sep 4 06:18:11 PDT 2024

================
@@ -22137,6 +22137,22 @@ static SDValue performExtendCombine(SDNode *N,
       N->getOperand(0)->getOpcode() == ISD::SETCC)
     return performSignExtendSetCCCombine(N, DCI, DAG);
 
+  // If we see (any_extend (bswap ...)) with bswap returning an i16, we know
+  // that the top half of the result register must be unused, due to the
+  // any_extend. This means that we can replace this pattern with (rev16
+  // (any_extend ...)). This saves a machine instruction compared to (lsr (rev
+  // ...)), which is what this pattern would otherwise be lowered to.
+  if (N->getOpcode() == ISD::ANY_EXTEND &&
+      N->getOperand(0).getOpcode() == ISD::BSWAP &&
+      N->getOperand(0).getValueType().isScalarInteger() &&
+      N->getOperand(0).getValueType().getFixedSizeInBits() == 16) {
----------------
adprasad-nvidia wrote:

To explain further - if the source code uses the result of the `__builtin_bswap16` as either an i32 or an i64, then a `zext i16 to i32/i64` is inserted in the IR before instruction selection. This patch's optimisation doesn't trigger because in the DAG, this `zext` becomes a `zero_extend` not an `any_extend`.
It's only if the result is used as an i16 that the `zext` is not inserted, in which case the `any_extend` i16 to i32 seems to be inserted during the initial building of the DAG from IR, presumably to help with type legalisation later on. Since this is just inserted by the code generator, the type should always be i32. 

https://github.com/llvm/llvm-project/pull/105375