[PATCH] D133850: [AArch64] Improve codegen for "trunc <4 x i64> to <4 x i8>" for all cases

Wed Sep 28 00:11:54 PDT 2022

mingmingl added a comment.

In D133850#3819802 <https://reviews.llvm.org/D133850#3819802>, @0x59616e wrote:

> In D133850#3815818 <https://reviews.llvm.org/D133850#3815818>, @mingmingl wrote:
>
>> In D133850#3814014 <https://reviews.llvm.org/D133850#3814014>, @0x59616e wrote:
>>
>>> bitcast is handled in this diff.
>>>
>>> To handle bitcast, we need this observation: `uzp1` is just a `xtn` that operates on two registers simultaneously.
>>>
>>> For example, given the following register with type `v2i64`:
>>>
>>> LSB______MSB
>>>
>>> | x0 x1 | x2 x3 |
>>> |
>>>
>>> Applying `xtn` on it we get:
>>>
>>> | x0 | x2 |
>>> |
>>>
>>> This is equivalent to bitcast it to `v4i32`, and then applying `uzp1` on it:
>>>
>>> | x0 | x1 | x2 | x3 |
>>> |
>>>
>>> === uzp1 ===>
>>>
>>> | x0 | x2 | <value from other register> |
>>> |
>>>
>>> We can transform `xtn` to `uzp1` by this observation, and vice versa.
>>>
>>> This observation only works on little endian target. Big endian target has a problem: the `uzp1` cannot be replaced by `xtn` since there is a discrepancy in the behavior of `uzp1` between the little endian and big endian. To illustrate, take the following for example:
>>>
>>> LSB________MSB
>>>
>>> | x0 | x1 | x2 | x3 |
>>> |
>>>
>>> On little endian, `uzp1` grabs `x0` and `x2`, which is right; on big endian, it grabs `x3` and `x1`, which doesn't match what I saw on the document. But, since I'm new to AArch64, take my word with a pinch of salt. This bevavior is observed on gdb, maybe there's issue in the order of the value printed by it ?
>>>
>>> Whatever the reason is, the execution result given by qemu just doesn't match. So I disable this on big endian target temporarily until we find the crux.
>>
>> **Take this with a grain of salt **
>>
>> My understanding is that, 'BITCAST' on little-endian works in this context since the element order and byte order is consistent that 'bitcast' won't change the relative order of bytes before and after the cast.
>>
>> Use LLVM IR <2 x i64> as an example, we refer to element 0 as A0 and element 1 as A1, refer to the higher half (MSB) as A0H, and lower half as A0L
>>
>> For little-endian,
>>
>> 1. A0 is in lane 0 of the register and A1 is in lane1 of the register, with memory representation as
>>
>>   0x0 0x4  0x8  0xc
>>   A0L A0H A1L A1H
>>
>>
>>
>> 2. After `bitcast <2 x i64> to <4 x i32>` (which is a store followed by a load), the q0 register is still `A0L A0H A1L A1H` and LLVM IR <4 x i32> element 0 is `A0L`
>>
>> For big-endian, the memory layout of <2 x i64> is
>>
>>   0x0 0x4 0x8 0xc
>>   A0H A0L A1H A1L
>>
>> So after a bitcast to `<4 x i32>`, q0 register becomes `A0H A0L A1H A1L` -> for LLVM IR <4 x i32>, element 0 is `A0H` -> this changes the shuffle result.
>>
>> p.s. I use small functions like https://godbolt.org/z/63h9xja5e and https://gcc.godbolt.org/z/EsE3eWW71 to wrap my head around the mapping among {LLVM IR, register lanes, memory layout}.
>
> Just for curious: This optimization involves a lot of bitcasts. Does the benefit of less xtn outweigh the copious bitcast instructions, i.e. `rev(16|32|64)` and `ext` ?
>
> If no, maybe we can just implement this only on little endian ?

I myself haven't thought deeply into how to fixing this particular issue in big-endian, but wanted to know mapping of {llvm ir, register lane, memory layout} thereby the paragraph above  -> that's partly why I suggest fixing little-endian for simplicity earlier :-)

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133850/new/

https://reviews.llvm.org/D133850