[PATCH] AARCH64_BE load/store rules fix for ARM ABI

Wed Mar 19 01:00:16 PDT 2014

Hi Tim,

I talked to GCC guys, and now I think I can understand what the key points
are.

We need to rule out both in-memory and in-register layout.

* In-memory consistency: The arguing is short vector needs be always like
array no matter whether it's source is from auto-vectorizer or manually
written short vector, so that means for "int16x8_t v",

*assertion:* v[i] is always to access data at "&v+i*sizeof(int16_t)".

Now I think this requirement is acceptable for me, and that means "res =
*(float *)vec" in your case should be valid, even if the typecasting in C
is unsafe.

* In-register consistency: We have two solutions,

1) keep the same layout for big-endian and little-endian. We would needn't
to change element index for big-endian.

With this solution, for big-endian, we need to use either "ldr/str + rev"
or ld1/st1. As of choosing between them, it depends on alignment, strict
mode, address mode and cycles etc.

2) If we keep different layout for big-endian and little-endian, we would
have to change element index of vector element access for one of the
endians (usually it is big-endian), and this is because low index element
will be in high address of big-endian in-register data.

With this solution, for big-endian, we need to use either "ld1/st1 + rev"
or ldr/str. As of choosing between them, it depends on alignment, strict
mode, address mode and cycles etc.

In theory, both solution 1) and 2) can work, but we have to choose solution
2), because

A) For some misc reasons including aarch64, ldr/str is better than ld1/st1.
B) GCC intends to use solution 2). we need to link binary generated with
gcc.
C) AAPCS64 requires vector to be load/store as if ldr/str

And with solution 2), we also prefer ldr/str whenever possible. We only
want to use "ld1/st1 + rev" for unaligned access in strict mode.

P.S:

1) I think AAPCS64 is unclear about requiring ldr/str for short vector,
because short vector is a definition with constraint "total size
alignment". For short vector with element alignment, AAPCS64 doesn't
clearly describe the load/store behaviour.
2) It's also possible to propagate rev instruction, then we can avoid
changing element index for scenario 2.
3) AAPCS64 is lack of describing in-memory and in-register data layout for
short vector, and they should belong to part of ABI.

Thanks,
-Jiangning

2014-03-18 18:21 GMT+08:00 Jiangning Liu <liujiangning1 at gmail.com>:

> Hi Tim,
>
>> >   %induction8 = add <8 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2,
>> i32 3, i32 4, i32 5, i32 6, i32 7>
>>
>> > If you want to generate "rev16+ st1", another programmer p2 would see
>> FAIL.
>>
>> The constant vector in the "add" line is probably not what you were
>> expecting. Since it's basically saying "put 0 in element 0, 1 in
>> element 1, ..." its lanes are reversed like everything else; you might
>> write:
>>     movz w0, #7
>>     umov v0.h[0], w0
>>     [...]
>>
>>
> Well, you are using different in-register layout for bit-endian and
> little-edian. I don't think this is correct. I think previously we have
> been arguing we should have consistent in-register value for little-endian
> and big-endian.
>
> For big-endian, in literal pool, we should store that constant vector in
> reversed order, then we wouldn't have any issue using ldr.
>
>
>> >> let Requires = [IsBE] in
>> >> def : Pat<(v4i16 (unaligned_load addr:$Rn)), (REV16 (LD1 addr:$Rn))>;
>> >
>> > Well, what unaligned_load is here? I'm confused again! Isn't just to
>> check
>> > not total size alignment as I described in my proposal?
>>
>> Pretty much, yes.
>>
>
> OK. I'm happy you agree with this.
>
>
>>
>> > Instead we want to simply use the pattern below,
>> >
>> > let Requires = [IsBE] in
>> > def : Pat<(v4i16 (unaligned_load addr:$Rn)), (LD1 addr:$Rn)>;
>>
>> Yep. In that world we would have to write
>>
>> let Requires = [IsBE] in
>> def : Pat<(v4i16 (aligned_load addr:$Rn)), (REV16 (LDR addr:$Rn))>;
>>
>> when we wanted the efficiency gains of LDR.
>>
>
> OK. If you say "ldr+rev16" has better performance than ld1, then I'm OK.
> But it's not a correctness issue but a performance one.
>
>
>>
>> > Can you also give me a real case in C code, and show me the IR that we
>> can't
>> > simply use ld1/st1 without rev16?
>>
>> In what way? When mixed with ldr? It's basically that branching one I
>> gave earlier in IR form:
>>
>> float32_t foo(int test, float32_t *__restrict arr, float32x4_t *
>> __restrict vec) {
>>   float res;
>>   if (test) {
>>     arr[0] += arr[0];
>>     arr[1] += arr[1];
>>     arr[2] += arr[2];
>>     arr[3] += arr[3];
>>     res = arr[0];
>>   } else {
>>     *vec += *vec;
>>     res = *(float *)vec;
>>
>
> This is an invalid C code. vec is a pointer, the data layout pointed by
> vec is different for big-endian and little-endian. If you do typecasting to
> float*, you can only guarantee little-endian work. Programmer must know the
> data layout change, because they are different data types at all.
> Programmer should code like,
>
> res = vec[0];
>
> We have known typecasting in C is sometimes unsafe. This is a quite
> natural common sense, I think.
>
>>
>> > In the file transferred of scenario 1, the char ordering in disk should
>> be
>> > like 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ..., 127. (let's ignore the ordering
>> > difference for inside each number for big-endian and little-endian,
>> because
>> > we are not discussing about that). In the file transferred of scenario
>> 2,
>> > the char order in disk should be like 7, 6, 5, 4, 3, 2, 1, 0, 15, 14,
>> ... ,
>> > 127,..., 121, 120.
>>
>> Not according to what Clang generates presently. Clang treats vec[j]
>> as starting from the lowest addressed element too (and the
>> initialisation you perform).
>>
>
> I don't understand why you say clang matters this. The front-end doesn't
> care the address at all, I think. Clang should only think vec[0] means lane
> 0 access. That's it.
>
>
>>
>> PPC is working on the semantics of IR as it stands and as I'm
>> describing. That's one of the things you seem to be proposing we
>> change and will be opposed.
>>
>>
> If this is true for PPC, then the answer should be yes. But I don't see
> clang and middle-end are doing bad anything we can't accept for aarch64. I
> ever tried PPC, I don't see any lane change for big-endian in LLVM IR. So I
> doubt this is true.
>
> Thanks,
> -Jiangning
>
>

-- 
Thanks,
-Jiangning
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140319/d7826642/attachment.html>