[PATCH] AARCH64_BE load/store rules fix for ARM ABI

Fri Mar 14 01:05:01 PDT 2014

Hi Jiangning,

>>     declare void @foo(<4 x i16>)
>>     define void @bar(<4 x i16>* %addr) {
>>       %vec = load <4 x i16>* %addr
>>       call void @foo(<4 x i16> %vec)
>>       ret void
>>     }
>>
>> The AAPCS requires us to pass %vec *as if* it had been loaded by
>> "ldr", so if we use "ld1 {v0.4h}" we need some kind of "rev"
>> instruction to reformat it before the call. Otherwise a round-trip via
>> GCC (for example) could produce incorrect results (say, if @foo simply
>> stored the vector back to %addr).
>>
>
> For "rev" instructions, ARMv8ARM says "An application or device driver might
> have to interface to memory-mapped peripheral registers or shared memory
> structures that are not the same endianness as the internal data
> structures.", so we would only need this instruction if we want to interact
> between little-endian and big-endian. For the scenario of supporting an
> unique one only, either of big or little, we needn't to use this instruction
> at all, because with the hardware endianness support, ldr/ld1 could always
> behave correctly for different endianness.

I don't think we're understanding each other. The instructions may
have been created for I/O purposes, but if we load vectors via ld1/st1
then we'll also have to use them to ensure ABI compatibility.

I assume you have access to a big-endian simulator at least. Try
compiling this with GCC:

    #include <arm_neon.h>
    extern int32x2_t var;

    void foo(int32x2_t in) {
       var = in;
    }

and this with LLVM (or mentally, if Clang doesn't do what you want):

    #include <stdio.h>
    #include <arm_neon.h>

    int32x2_t var;
    extern void foo(int32x2_t);

    int main() {
      var = vset_lane_s32(1, var, 0);
      var = vset_lane_s32(2, var, 1);
      foo(var);
      printf("%d %d\n", vget_lane_s32(var, 0), vget_lane_s32(var, 1));
    }

I think we can both agree that this should print "1 2", but I think
you'll find extra REVs are needed for compatibility if you decide
Clang must use ld1/st1.

> For the bitcast, I don't think we should generate any instruction. If only
> ldr/str and ld1/st1 can be used in pair, we shouldn't have any issue at all.

Agreed, until we hit an ABI-visible boundary. Unfortunately it's also
essentially impossible come up with rules other than the degenerate
("always use ldr/str" or "always use ld1/st1") to ensure that they
*are* only used in pairs like that.

> I would say for this case, to support strict mode, we should have to use
> ld1, although the address might have been expanded to 8-byte aligned,
> because "align 2" implies the data is from an array of elements,

>From LLVM's perspective the "align 2" is an optimisation hint, with no
effect on semantics. It *will* be changed by optimisers, if they spot
that a larger alignment can be guaranteed by whatever means they
choose.

We cannot change that (and I wouldn't want to even if I could, it
would introduce a shadow type system and be horrible) and have to make
codegen work regardless.

>> I don't believe so. Clang will generate array accesses as scalar
>> operations. The vectorizer may transform them, but only into
>> well-specifier LLVM IR. How we implement that is our choice, and we
>> can use either ld1 or ldr. Do you have a counter-example?
>>
> Yes, we can apply optimization, but we should change the semantic interface
> crossing functions. My example is in a .h file, if we define,
>
> extern int16_t a[4];
>
> In function f1 defined in file file1, and function f2 in file file2, we
> should guarantee to use ld1/st1 to load/store variable a.

We should make no such guarantees. We should be free to use whatever
instructions we like as long as the semantics of C and LLVM IR are
preserved.

>  In C, we say (a++ == &a[1]) is true,
> this should be guaranteed for big-endian as well.

Definitely, but we can still use ldr/str and make this work. It
involves remapping all lane-referencing instructions during codegen.
To make the example more concrete, under the ldr/str scheme (assuming
alignment can be preserved), a function like:

    int16_t foo() { return a[0]; }

Might produce (for some odd reasons, but it's valid IR: we guarantee
element 0 has the lowest address even in big-endian):

    define i16 @foo() {
      %val = load <4 x i16>* bitcast([4 x i16]* @a to <4 x i16>*)
      %elt = extractelement <4 x i16> %val, i32 0
      ret i16 %elt
    }

This could then be assembled (assuming the alignment requirements are met) to:
    foo:
        adrp, x0, a
        ldr d0, [x0, :lo12:a]
        umov w0, v0.h[3]

Note the "[3]", rather than "[0]".

Cheers.

Tim.