[PATCH] AARCH64_BE load/store rules fix for ARM ABI

Sat Mar 15 00:50:01 PDT 2014

Hi Jaiangning,

>>     define i16 @foo() {
>>       %val = load <4 x i16>* bitcast([4 x i16]* @a to <4 x i16>*)
>>       %elt = extractelement <4 x i16> %val, i32 0
>>       ret i16 %elt
>>     }
>>
>> This could then be assembled (assuming the alignment requirements are met)
>> to:
>>     foo:
>>         adrp, x0, a
>>         ldr d0, [x0, :lo12:a]
>>         umov w0, v0.h[3]
>>
>> Note the "[3]", rather than "[0]".
>>
>
> Can you explain how to get address &a[0] for big-endian? It's a C
> expression, and I think it should be simply
>
> Suppose the data in a[4] is,
> 0x00aabb00: 01
> 0x00aabb02: 02
> 0x00aabb04: 03
> 0x00aabb06: 04
>
> I think &a[0] is 0x00aabb00. If you agree, then

> int16_t *p = &a[0]; p += 3; return *p;
>
> What the return value is for big-endian? I think the value should be 04
> rather than 01.

I agree with all this.

> Following you assembly code you gave, the result would be
> 01. I don't think it make sense, and isn't a disaster for programmer.

But you've changed the source, of course the assembly I gave isn't
valid any more. The IR isn't either. For that expression, the LLVM IR
would be (remember, extractelement in IR form always starts at the
lowest addressed one):

     define i16 @foo() {
       %val = load <4 x i16>* bitcast([4 x i16]* @a to <4 x i16>*)
       %elt = extractelement <4 x i16> %val, i32 3
       ret i16 %elt
     }

You can check this far easily enough yourself: put your block into
that "lhs[N] += rhs[N]" example I gave to encourage the SLP-vectorizer
and see what Clang gives you.

Anyway, I advocate assembling this to:

foo:
    adrp x0, a
    ldr d0, [x0, :lo12:a]
    umov w0, v0.h[0]

> 2) For the case you gave,
>>
>>
>> define i16 @foo(i1 %tst, <4 x i16>* %arr_ptr, <4 x i16>* %vec_ptr) {
>>   br i1 %tst, label %load_vec, label %load_arr
>>
>> load_vec:
>>   %vec = load <4 x i16>* %vec_ptr, align 8
>>   br label %end
>>
>> load_arr:
>>   %arr = load <4 x i16>* %arr_ptr, align 2
>>   br label %end
>>
>> end:
>>   %val = phi <4 x i16> [%vec, %load_vec], [%arr, %load_arr]
>>   %elt = extractelement <4 x i16> %val, i32 0
>>   ret i16 %elt
>> }
>>
>> Cheers.
>>
>> Tim.
>>
>> P.S. I argue (modulo bugs):
>>
>> foo:
>>      cbz w0, .Lload_arr
>> .Lload_vec:
>>      ldr d0, [x2]
>>      b .Lend
>> .Lload_arr:
>>     ld1 { v0.4h }, [x1]
>
>
> I'm confused here! I think my solution is to ld1 in big-endian for align 2,
> and you has been arguing not but use ldr.

I argue we should use ldr where possible, but if not an ld1/rev pair
to emulate the effect of ldr.

> And with my solution, you needn't rev16 at all. If programmer is writing
> code like this or compiler is generating code like this, programmer or
> compiler should have known arr_ptr and vec_ptr have different data layout
> for big-endian!

That's not what the LLVM IR says or how it's treated by the
optimizers! If you want to provide vector types with weird differing
semantics, you can still do that in the front-end. Once the LLVM IR is
created it will have fixed semantics, and be subject to all the normal
passes.

> It's quite strage to add rev16 here! How and when do you decide to add rev16
> in compiler? Isn't it a disaster for compiler implementation?

You add it after every ld1 and before every st1 (except those produced
by intrinsics like @llvm.arm.vst1). Some of these might be foldable by
the DAG combiner, but otherwise it's just the price you must pay to be
able to see an "extractelement" and know which number to give back.

> int16_t a[4];
> int16x4_t b;
>
> Global variables a and b are different. a requires a fixed element layout,
> but b has different element layout for big-endian and little-endian,
> although both a and b have different data layout within element.
>
> Programmers or tools can't change this, otherwise it would be software
> ecosystem disaster.
>
> With your solution, it would introduce a lot of 'strange' code, if use
> ldr/str for variable a in big-endian mode. It's unacceptable for me.

I really don't believe it would. Could you give an end-to-end example
of such a disaster in my scheme? C code, LLVM IR and assembly? We can
discuss just where we start to disagree about the correct
transformations.

>> whether this is the ldr/str representation or the ld1/st1
>> representation is less important to me,
>
>
> I think it does matter because ldr/str and ld1/st1 have different data
> layout semantics.

I'm arguing that we could write a correct compiler that did either.
Which instruction we use should be completely invisible.

We could actually go further; we could write a compiler where, on
every load and store we make sure the lanes are in the following
order: {3, 1, 4, 5, 2, 6, 7, 0} in the vector register. It would be
completely perverse, but we could make it work if we tried hard
enough.

> Yes, we should avoid mixing them. The issue is how to guarantee a stable
> interface crossing different modules.
> Only using ldr/str for big-endian would introduce a lot of strange code in
> big-endian binary. Given that we have ld1/st1, why do we need those strange
> code?

Efficiency. The advantages of using ldr are:
1. We get to use normal adrp/ldr addressing for globals instead of adrp/add/ld1
2. We don't have to permute vectors being passed or returned from
function calls.

Maybe it's not worth it, or maybe it's worth it in some circumstances
but not others. Perhaps uses within a function outweigh uses across
ABI boundaries and we should attach the rev16 to the ldr instead.

But I think it's very important that we understand REVs will be needed
on one of them to preserve backend consistency.

> 4) Can you explain when ld1/st1 will be used for big-endian mode with your
> solution? What is the algorithm of generating ld1/st1 for compiler for
> big-endian? Or you are proposing not using ld1/st1 forever?

I propose: use ld1/st1 when the alignment requirements make it
necessary. Use them with a pattern like:

let Requires = [IsBE] in
def : Pat<(v4i16 (unaligned_load addr:$Rn)), (REV16 (LD1 addr:$Rn))>;

(We might be able to do some optimisations to fold some of those REVs,
but that would presumably come later).

> If you
> say for this case, the solution would be to use ld1/st1, then is it just to
> check alignment to decide which instruction we should use?

Yes. Given that ld1/st1 would require an extra REV, we would almost
always prefer ldr/str. The exception being when alignment prohibited
it.

> Anyway, I don't think the solution of doing things like changing [0] to [3]
> and inserting rev instruction by using some 'smart algorithm' in compiler is
> a reasonable choice.

Fair enough on the [0] to [3], but the REV problem isn't going away:
if we mix ld1/st1 and ldr/str, one of them will have to have a REV
attached as the default situation.

As I said, I agreed with you about that a couple of weeks ago. Both
representations could be made to work. In this case, the warts are in
function argument passing and bitcasts:

define i16 @foo(<4 x i16> %in) {
  %elt = extractelement <4 x i16> %in, i32 0
  ret i16 %elt
}

In the ld1/st1 world, this would have to compile to:
foo:
    rev16 v0.4h, v0.4h
    umov w0, v0.h[0]
    ret

Many of those REVs will be foldable of course. In theory. Doing it
well across basic-block boundaries strikes me as a difficult problem
in LLVM.

>     define i16 @foo() {
>       %val = load <4 x i16>* bitcast([4 x i16]* @a to <4 x i16>*)
>       %elt = extractelement <4 x i16> %val, i32 0
>       ret i16 %elt
>     }
>
> For this case, I assume optimizer wants to generate it. Then I would say
> optimization is generating invalid code for big-endian.
>
> I understand auto-vectorizer want to utilize this kind of casting to
> generate code using vector type. To achieve this goal, I think for
> big-endian, we should not only introduce bitcast, but "rev" in LLVM IR.

That solution is isomorphic to reversing the labelling of lanes in
big-endian (i.e. saying that element 0 has the highest address). It's
not going to happen. People discussed what the LLVM IR should mean a
long time ago and decided element 0 has the lowest address, even in
big-endian. PPC SIMD support (& possibly Mips) has been implemented
around this decision.

Clang knows this, and the optimisers know this. It's our job to teach
it to our backend.

Cheers.

Tim.