[PATCH] D133850: [AArch64] Improve codegen for "trunc <4 x i64> to <4 x i8>" for all cases

Tue Sep 27 20:17:01 PDT 2022

0x59616e added a comment.

In D133850#3815818 <https://reviews.llvm.org/D133850#3815818>, @mingmingl wrote:

> In D133850#3814014 <https://reviews.llvm.org/D133850#3814014>, @0x59616e wrote:
>
>> bitcast is handled in this diff.
>>
>> To handle bitcast, we need this observation: `uzp1` is just a `xtn` that operates on two registers simultaneously.
>>
>> For example, given the following register with type `v2i64`:
>>
>> LSB______MSB
>>
>> | x0 x1 | x2 x3 |
>> |
>>
>> Applying `xtn` on it we get:
>>
>> | x0 | x2 |
>> |
>>
>> This is equivalent to bitcast it to `v4i32`, and then applying `uzp1` on it:
>>
>> | x0 | x1 | x2 | x3 |
>> |
>>
>> === uzp1 ===>
>>
>> | x0 | x2 | <value from other register> |
>> |
>>
>> We can transform `xtn` to `uzp1` by this observation, and vice versa.
>>
>> This observation only works on little endian target. Big endian target has a problem: the `uzp1` cannot be replaced by `xtn` since there is a discrepancy in the behavior of `uzp1` between the little endian and big endian. To illustrate, take the following for example:
>>
>> LSB________MSB
>>
>> | x0 | x1 | x2 | x3 |
>> |
>>
>> On little endian, `uzp1` grabs `x0` and `x2`, which is right; on big endian, it grabs `x3` and `x1`, which doesn't match what I saw on the document. But, since I'm new to AArch64, take my word with a pinch of salt. This bevavior is observed on gdb, maybe there's issue in the order of the value printed by it ?
>>
>> Whatever the reason is, the execution result given by qemu just doesn't match. So I disable this on big endian target temporarily until we find the crux.
>
> **Take this with a grain of salt **
>
> My understanding is that, 'BITCAST' on little-endian works in this context since the element order and byte order is consistent that 'bitcast' won't change the relative order of bytes before and after the cast.
>
> Use LLVM IR <2 x i64> as an example, we refer to element 0 as A0 and element 1 as A1, refer to the higher half (MSB) as A0H, and lower half as A0L
>
> For little-endian,
>
> 1. A0 is in lane 0 of the register and A1 is in lane1 of the register, with memory representation as
>
>   0x0 0x4  0x8  0xc
>   A0L A0H A1L A1H
>
>
>
> 2. After `bitcast <2 x i64> to <4 x i32>` (which is a store followed by a load), the q0 register is still `A0L A0H A1L A1H` and LLVM IR <4 x i32> element 0 is `A0L`
>
> For big-endian, the memory layout of <2 x i64> is
>
>   0x0 0x4 0x8 0xc
>   A0H A0L A1H A1L
>
> So after a bitcast to `<4 x i32>`, q0 register becomes `A0H A0L A1H A1L` -> for LLVM IR <4 x i32>, element 0 is `A0H` -> this changes the shuffle result.
>
> p.s. I use small functions like https://godbolt.org/z/63h9xja5e and https://gcc.godbolt.org/z/EsE3eWW71 to wrap my head around the mapping among {LLVM IR, register lanes, memory layout}.

Just for curious: This optimization involves a lot of bitcasts. Does the benefit of less xtn outweigh the copious bitcast instructions, i.e. `rev(16|32|64)` and `ext` ?

If no, maybe we can just implement this on little endian ?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133850/new/

https://reviews.llvm.org/D133850