[PATCH] AARCH64_BE load/store rules fix for ARM ABI

Tim Northover t.p.northover at gmail.com
Thu Mar 20 05:47:31 PDT 2014


Hi Jiangning,

> I talked to GCC guys, and now I think I can understand what the key points
> are.

Yep, I agree completely with everything below, so I think we do
finally understand each other. Sorry for the dodgy explanations
earlier.

Cheers.

Tim.

> We need to rule out both in-memory and in-register layout.
>
> * In-memory consistency: The arguing is short vector needs be always like
> array no matter whether it's source is from auto-vectorizer or manually
> written short vector, so that means for "int16x8_t v",
>
> assertion: v[i] is always to access data at "&v+i*sizeof(int16_t)".
>
> Now I think this requirement is acceptable for me, and that means "res =
> *(float *)vec" in your case should be valid, even if the typecasting in C is
> unsafe.
> * In-register consistency: We have two solutions,
>
> 1) keep the same layout for big-endian and little-endian. We would needn't
> to change element index for big-endian.
>
> With this solution, for big-endian, we need to use either "ldr/str + rev" or
> ld1/st1. As of choosing between them, it depends on alignment, strict mode,
> address mode and cycles etc.
>
> 2) If we keep different layout for big-endian and little-endian, we would
> have to change element index of vector element access for one of the endians
> (usually it is big-endian), and this is because low index element will be in
> high address of big-endian in-register data.
>
> With this solution, for big-endian, we need to use either "ld1/st1 + rev" or
> ldr/str. As of choosing between them, it depends on alignment, strict mode,
> address mode and cycles etc.
>
> In theory, both solution 1) and 2) can work, but we have to choose solution
> 2), because
>
> A) For some misc reasons including aarch64, ldr/str is better than ld1/st1.
> B) GCC intends to use solution 2). we need to link binary generated with
> gcc.
> C) AAPCS64 requires vector to be load/store as if ldr/str
>
> And with solution 2), we also prefer ldr/str whenever possible. We only want
> to use "ld1/st1 + rev" for unaligned access in strict mode.
>
> P.S:
>
> 1) I think AAPCS64 is unclear about requiring ldr/str for short vector,
> because short vector is a definition with constraint "total size alignment".
> For short vector with element alignment, AAPCS64 doesn't clearly describe
> the load/store behaviour.
> 2) It's also possible to propagate rev instruction, then we can avoid
> changing element index for scenario 2.
> 3) AAPCS64 is lack of describing in-memory and in-register data layout for
> short vector, and they should belong to part of ABI.
>
> Thanks,
> -Jiangning
>
>
>
> 2014-03-18 18:21 GMT+08:00 Jiangning Liu <liujiangning1 at gmail.com>:
>
>> Hi Tim,
>>>
>>> >   %induction8 = add <8 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2,
>>> > i32 3, i32 4, i32 5, i32 6, i32 7>
>>>
>>> > If you want to generate "rev16+ st1", another programmer p2 would see
>>> > FAIL.
>>>
>>> The constant vector in the "add" line is probably not what you were
>>> expecting. Since it's basically saying "put 0 in element 0, 1 in
>>> element 1, ..." its lanes are reversed like everything else; you might
>>> write:
>>>     movz w0, #7
>>>     umov v0.h[0], w0
>>>     [...]
>>>
>>
>> Well, you are using different in-register layout for bit-endian and
>> little-edian. I don't think this is correct. I think previously we have been
>> arguing we should have consistent in-register value for little-endian and
>> big-endian.
>>
>> For big-endian, in literal pool, we should store that constant vector in
>> reversed order, then we wouldn't have any issue using ldr.
>>
>>>
>>> >> let Requires = [IsBE] in
>>> >> def : Pat<(v4i16 (unaligned_load addr:$Rn)), (REV16 (LD1 addr:$Rn))>;
>>> >
>>> > Well, what unaligned_load is here? I'm confused again! Isn't just to
>>> > check
>>> > not total size alignment as I described in my proposal?
>>>
>>> Pretty much, yes.
>>
>>
>> OK. I'm happy you agree with this.
>>
>>>
>>>
>>> > Instead we want to simply use the pattern below,
>>> >
>>> > let Requires = [IsBE] in
>>> > def : Pat<(v4i16 (unaligned_load addr:$Rn)), (LD1 addr:$Rn)>;
>>>
>>> Yep. In that world we would have to write
>>>
>>> let Requires = [IsBE] in
>>> def : Pat<(v4i16 (aligned_load addr:$Rn)), (REV16 (LDR addr:$Rn))>;
>>>
>>> when we wanted the efficiency gains of LDR.
>>
>>
>> OK. If you say "ldr+rev16" has better performance than ld1, then I'm OK.
>> But it's not a correctness issue but a performance one.
>>
>>>
>>>
>>> > Can you also give me a real case in C code, and show me the IR that we
>>> > can't
>>> > simply use ld1/st1 without rev16?
>>>
>>> In what way? When mixed with ldr? It's basically that branching one I
>>> gave earlier in IR form:
>>>
>>> float32_t foo(int test, float32_t *__restrict arr, float32x4_t *
>>> __restrict vec) {
>>>   float res;
>>>   if (test) {
>>>     arr[0] += arr[0];
>>>     arr[1] += arr[1];
>>>     arr[2] += arr[2];
>>>     arr[3] += arr[3];
>>>     res = arr[0];
>>>   } else {
>>>     *vec += *vec;
>>>     res = *(float *)vec;
>>
>>
>> This is an invalid C code. vec is a pointer, the data layout pointed by
>> vec is different for big-endian and little-endian. If you do typecasting to
>> float*, you can only guarantee little-endian work. Programmer must know the
>> data layout change, because they are different data types at all. Programmer
>> should code like,
>>
>> res = vec[0];
>>
>> We have known typecasting in C is sometimes unsafe. This is a quite
>> natural common sense, I think.
>>>
>>>
>>> > In the file transferred of scenario 1, the char ordering in disk should
>>> > be
>>> > like 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ..., 127. (let's ignore the ordering
>>> > difference for inside each number for big-endian and little-endian,
>>> > because
>>> > we are not discussing about that). In the file transferred of scenario
>>> > 2,
>>> > the char order in disk should be like 7, 6, 5, 4, 3, 2, 1, 0, 15, 14,
>>> > ... ,
>>> > 127,..., 121, 120.
>>>
>>> Not according to what Clang generates presently. Clang treats vec[j]
>>> as starting from the lowest addressed element too (and the
>>> initialisation you perform).
>>
>>
>> I don't understand why you say clang matters this. The front-end doesn't
>> care the address at all, I think. Clang should only think vec[0] means lane
>> 0 access. That's it.
>>
>>>
>>>
>>> PPC is working on the semantics of IR as it stands and as I'm
>>> describing. That's one of the things you seem to be proposing we
>>> change and will be opposed.
>>>
>>
>> If this is true for PPC, then the answer should be yes. But I don't see
>> clang and middle-end are doing bad anything we can't accept for aarch64. I
>> ever tried PPC, I don't see any lane change for big-endian in LLVM IR. So I
>> doubt this is true.
>>
>> Thanks,
>> -Jiangning
>>
>
>
>
> --
> Thanks,
> -Jiangning



More information about the llvm-commits mailing list