[PATCH] AARCH64_BE load/store rules fix for ARM ABI

Mon Mar 3 12:51:23 PST 2014

Hi Albrecht,

(Not a review of the new patch, I'm afraid).

> However: we only have one .ll type for that <4 x i32>
> So in BE, we can't allow both stores for the following - they would result
> in different memory layout - and how would the one who reads know what to
> do:

That, I think we can both agree on for the purposes of this discussion.

> Having done some geometry, there're some applications for 4x4 lattices -
> e.g. to compute perspective (sp?) views, etc.

Ok. Matrix is the usual English term, but I'm following here.

> So, 25 years ago I'd have thought "gee - there's a <4 x single> register
> type, I'll define the lattice to be 4 of them".
>
> Then to transpose such a lattice I'd have used the lane load intrinsics -
> why not: seems to be the intended use.
>
> And now I'm bummed:
> Those nifty <XYZW> vectors have been stored to memory as <WZYX> by STR - so
> the lane load that should have loaded Y actually loads Z.

Well, fair enough, but that's down to the specification for arm_neon.h
and has absolutely nothing to do with how we interpret the LLVM IR
that we see.

You should (hypothetically) take that up with the writers of
arm_neon.h and in the meantime rewrite your code so that it refers to
the lane numbers as defined by the spec. In this case, I think the
big-endian transpose would be (untested):

    float32x4x4_t matrix;
    matrix = vld4q_lane_f32(addr, matrix, 3);
    matrix = vld4q_lane_f32(addr + 1, matrix, 2);
    matrix = vld4q_lane_f32(addr + 2, matrix, 1);
    matrix = vld4q_lane_f32(addr + 3, matrix, 0);
    vst4q_f32(addr, matrix);

It sucks, but that's got nothing to do with LLVM.

> For me it's a BE-only type clash that can only be solved by not mixing LDR
> type loads with array type loads.

OK, here I think you're looking at things from the wrong perspective.
Regardless of what C, ARM or anyone else demands, a given LLVM IR
fragment is going to have a required set of semantics.

If those semantics match a particular ld1 instruction (and it would be
fastest), we should use it. Regardless of whether the user got here
via undefined behaviour, assuming big-endian was actually little,
actual intent or sheer good fortune.

> So I can't really recommend allowing regular pattern matching to emit array
> loads, when most other loads use a different layout.

I doubt we'd get there for array loads anyway: arrays aren't backend
value types so they're gone by the time the AArch64 code gets
involved.

> You might say "But reading array type data is fine".
> You're right - but that <4 x single>* is so easily casted to single[4]* -
> after all C doesn't have the strongest type system, and it normally just
> works - especially on AARCH64_LE.

That's almost certainly implementation-defined behaviour at best. We
should not be accommodating it. Certainly not at the expense of
well-behaved code by people who *do* know what they're doing.

>  - and the arm_neon.h doesn't look terribly type-safe as well:
>
> #define vst1q_f32_x2(__a, b) __extension__ ({ \
>   float32x4x2_t __b = (b); \
>   __builtin_neon_vst1q_x2_v(__a, (int8x16_t)__b.val[0],
> (int8x16_t)__b.val[1], 41); })

You're probably right. I've avoided thinking too hard about that
particular wrinkle because it might distress me; as long as it works
for the cases that *are* valid, I'd rather not know.

> That said: I currently follow the rule "single element is fine" (no swapping
> surprises possible).
> If you can come up with a rule that allows more instructions/patterns but is
> still safe, I'm all ears.

I'm not sure I follow this question, but my primary heuristic would
be: assume that whatever front-end or person that came up with the
LLVM IR you're processing knew what they were on about, and generate
the best code you can for it.

My biggest message is that, to disable the lane or duplicating
loads/stores, we need an LLVM IR fragment (without @llvm.aarch64.neon
intrinsics preferably) that would be miscompiled on big-endian if they
were allowed.

Cheers.

Tim.