[PATCH] AARCH64_BE load/store rules fix for ARM ABI

Mon Mar 3 06:48:23 PST 2014

> +let Predicates = [IsLE] in {
>  // Load single 1-element structure to all lanes of 1 register
> ----------------
> James Molloy wrote:
>> Shouldn't a splatting LD1 still work in BE mode?
> At most the ones that only read one element and duplicate that.
>
> The multi-element reads will have unexpected order (that struct was STRed!), so can only be used via intrinsics to read from arrays.

I'm not sure I follow here. Struct's aren't short vectors, so their
layout is dictated by the normal C rules and I think they will have
the expected order on both little and big-endian machines. The example
I'm thinking of might be written as:

    #include <arm_neon.h>
    typedef struct { uint8_t r, g, b; } RGB;
    uint8x8x3_t read(RGB *colours) {
      uint8x8x3_t result;
      result.val[0] = vdup_n_u8(colours->r);
      result.val[1] = vdup_n_u8(colours->g);
      result.val[2] = vdup_n_u8(colours->b);
      return result;
    }

I think this would be best implemented as an ld3r on both big and
little-endian systems, and is the intended use of that instruction.

Could you give a snippet of either LLVM IR or C that you think we
might naively use ldNr for, but would be invalid on big-endian
systems? Just so I can get a better idea of what you're thinking of.

> -defm LD1LN : LDN_Lane_BHSD<0b0, 0b0, "VOne", "ld1">;
> +let Predicates = [IsLE] in {
> +  // Load single 1-element structure to one lane of 1 register.
> ----------------
> James Molloy wrote:
>> Will 1-element to 1-lane also work in BE mode?
> added the following comment to the pattern and removed the predicate.
>
> // This will not work as intended in BE mode, if the matcher generates it to
> // load a vector to a lane. (STR q0 stored the elements swapped)
> // Must always use an intrinsic, so the user knows it's loading from an array
> // layout.

I don't believe this is true either. Consider the alternatives for the IR:

    define <4 x i32> @foo(<4 x i32> %vec, i32* %addr) {
      %elt = load i32* %addr
       %newvec = insertelement <4 x i32> %vec, i32 %elt, i32 0
       ret <4 x i32> %newvec
    }

This is the obvious, canonical situation where we'd want a pattern for
"ld1 (lane)". And indeed we generate "ld1 {v0.4s}[0], [x0]". But
what's the alternative if the ld1 is disabled? I strongly suspect
you'll find it's
    ldr w0, [x0]
    ins v0.4s[0], w0

which has exactly the same semantics.

I think the problem will actually come with the intrinsics, where we
probably want to generate this sequence from "vld1_lane_s32(addr, vec,
3)" but I'd strongly suggest approaching that from the front-end since
it should be mapping to that LLVM IR anyway.

Cheers.

Tim.