[PATCH] D133850: [AArch64] Improve codegen for "trunc <4 x i64> to <4 x i8>" for all cases

Mon Sep 26 11:47:28 PDT 2022

mingmingl added a comment.

In D133850#3814014 <https://reviews.llvm.org/D133850#3814014>, @0x59616e wrote:

> bitcast is handled in this diff.
>
> To handle bitcast, we need this observation: `uzp1` is just a `xtn` that operates on two registers simultaneously.
>
> For example, given the following register with type `v2i64`:
>
> LSB______MSB
>
> | x0 x1 | x2 x3 |
> |
>
> Applying `xtn` on it we get:
>
> | x0 | x2 |
> |
>
> This is equivalent to bitcast it to `v4i32`, and then applying `uzp1` on it:
>
> | x0 | x1 | x2 | x3 |
> |
>
> === uzp1 ===>
>
> | x0 | x2 | <value from other register> |
> |
>
> We can transform `xtn` to `uzp1` by this observation, and vice versa.
>
> This observation only works on little endian target. Big endian target has a problem: the `uzp1` cannot be replaced by `xtn` since there is a discrepancy in the behavior of `uzp1` between the little endian and big endian. To illustrate, take the following for example:
>
> LSB________MSB
>
> | x0 | x1 | x2 | x3 |
> |
>
> On little endian, `uzp1` grabs `x0` and `x2`, which is right; on big endian, it grabs `x3` and `x1`, which doesn't match what I saw on the document. But, since I'm new to AArch64, take my word with a pinch of salt. This bevavior is observed on gdb, maybe there's issue in the order of the value printed by it ?
>
> Whatever the reason is, the execution result given by qemu just doesn't match. So I disable this on big endian target temporarily until we find the crux.

**Take this with a grain of salt **

My understanding is that, 'BITCAST' on little-endian works in this context since the element order and byte order is consistent that 'bitcast' won't change the relative order of bytes before and after the cast.

Use LLVM IR <2 x i64> as an example, we refer to element 0 as A0 and element 1 as A1, refer to the higher half (MSB) as A0H, and lower half as A0L

For little-endian,

1. A0 is in lane 0 of the register and A1 is in lane1 of the register, with memory representation as

  ``` 0x0 0x4  0x8  0xc A0L A0H A1L A1H ```

2. After `bitcast <2 x i64> to <4 x i32>` (which is a store followed by a load), the q0 register is still `A0L A0H A1L A1H` and LLVM IR <4 x i32> element 0 is `A0L`

For big-endian, the memory layout of <2 x i64> is

  0x0 0x4 0x8 0xc
  A0H A0L A1H A1L

So after a bitcast to `<4 x i32>`, q0 register becomes `A0H A0L A1H A1L` -> for LLVM IR <4 x i32>, element 0 is `A0H` -> this changes the shuffle result.

p.s. I use small functions like https://godbolt.org/z/63h9xja5e and https://gcc.godbolt.org/z/EsE3eWW71 to wrap my head around the mapping among {LLVM IR, register lanes, memory layout}.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:17584-17585
+  // Only implemented on little-endian subtargets.
+  bool IsBigEndian = DAG.getDataLayout().isBigEndian();
+  if (!IsBigEndian && VT.isSimple() && VT.is64BitVector() &&
+      VT.getVectorNumElements() >= 2) {
----------------
nit: maybe use `DataLayout::isLittleEndian` [1] directly, even if they expands to the same code if `isLittleEndian` is inlined.

[1] https://github.com/llvm/llvm-project/blob/1f451a8bd6f32465b6ff26c30ba7fb6fc7e0e689/llvm/include/llvm/IR/DataLayout.h#L244

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:17589-17591
+    assert((SimpleVT == MVT::v4i16 || SimpleVT == MVT::v2i32 ||
+            SimpleVT == MVT::v8i8) &&
+           "Expect SimpleVT to be one of v4i16 or v2i32 or v8i8");
----------------
(I wrote `assert` in the previous work as well)

On a second thought, it's more future proof to bail out if type is not one of {v4i116, v2i32, v8i8} in this context, given that UZP1 SDNode definition doesn't require vector element type to be integer (i.e. v4f16 is ok for compilation)

Something like

```
Type val;
switch (SimpleVT) {
  case valid-case1:
    val = ...;
    break;
  case valid-case2;
    val = ...
    break;
  default:
    break;
}

if val is not set
  bail out
```

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:17594
+    auto getSourceOp = [](SDValue Operand) -> SDValue {
+      if (Operand.getOpcode() == ISD::TRUNCATE)
+        return Operand->getOperand(0);
----------------
nit pick: it's more idiomatic to call `getOpcode()` in the LLVM codebase (even if compiler should do CSE)

```
const unsigned Opcode = Operand.getOpcode();
if (Opcode == ISD::TRUNCATE) 
  ...
if (Opcode == ISD::BITCAST)
  ...
```

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:17630-17633
+    if (HalfElementSize(SourceOp0, UzpOp0) &&
+        HalfElementSize(SourceOp1, UzpOp1))
+      UzpResult = DAG.getNode(AArch64ISD::UZP1, DL, UzpOp0.getValueType(),
+                              UzpOp0, UzpOp1);
----------------
If we see through `BITCAST` for Op0 and Op1 respectively but UzpOp0 and UzpOp1 have different return type (that could be casted to the same type), the current approach will feed `AArch64ISD::UZP1` with two SDValue of different type(see test case in https://godbolt.org/z/TT4ErT5Mf)

On the other hand, `AArch64ISD::UZP1` expects two operands have the same value type ([SDTypeProfile](https://github.com/llvm/llvm-project/blob/4ba1f04465859fc56514bd4dca21066eea595294/llvm/lib/Target/AArch64/AArch64InstrInfo.td#L284))

For little-endian, BITCAST both operands to the type of return value here should work.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133850/new/

https://reviews.llvm.org/D133850