[PATCH] AARCH64_BE load/store rules fix for ARM ABI

Mon Mar 3 13:45:50 PST 2014

On 2014-03-03 21:51, Tim Northover wrote:
> Hi Albrecht,

that was quick !

> (Not a review of the new patch, I'm afraid).
Oh nooooooo! :-)

>> Having done some geometry, there're some applications for 4x4 lattices 
>> -
>> e.g. to compute perspective (sp?) views, etc.
> 
> Ok. Matrix is the usual English term, but I'm following here.
darn - it took years to get accustomed to lattice!
Matrix is the German term as well.

>> So, 25 years ago I'd have thought "gee - there's a <4 x single> 
>> register
>> type, I'll define the lattice to be 4 of them".
>> 
>> Then to transpose such a lattice I'd have used the lane load 
>> intrinsics -
>> why not: seems to be the intended use.
>> 
>> And now I'm bummed:
>> Those nifty <XYZW> vectors have been stored to memory as <WZYX> by STR 
>> - so
>> the lane load that should have loaded Y actually loads Z.
> 
> Well, fair enough, but that's down to the specification for arm_neon.h
> and has absolutely nothing to do with how we interpret the LLVM IR
> that we see.

not really - favoring STR introduced an incompatible memory layout (for 
a good reason - see the vararg vector union register example)

These two in coexistence are asking for trouble - especially as higher 
level matrix algorithms do a lot of index computations.

> You should (hypothetically) take that up with the writers of
> arm_neon.h and in the meantime rewrite your code so that it refers to
> the lane numbers as defined by the spec. In this case, I think the
> big-endian transpose would be (untested):
> 
>     float32x4x4_t matrix;
>     matrix = vld4q_lane_f32(addr, matrix, 3);
>     matrix = vld4q_lane_f32(addr + 1, matrix, 2);
>     matrix = vld4q_lane_f32(addr + 2, matrix, 1);
>     matrix = vld4q_lane_f32(addr + 3, matrix, 0);
>     vst4q_f32(addr, matrix);

non-portable code between two instances of the same architecture is a 
first for me (but I could see it coming)

> It sucks, but that's got nothing to do with LLVM.

That's not even beginning to express the wording of a programmer who's 
bitten by code ported from AARCH64_LE to AARCH64_BE.
I wouldn't want to make 1st - 4th level support for that.

>> For me it's a BE-only type clash that can only be solved by not mixing 
>> LDR
>> type loads with array type loads.
> 
> OK, here I think you're looking at things from the wrong perspective.
> Regardless of what C, ARM or anyone else demands, a given LLVM IR
> fragment is going to have a required set of semantics.

Right, but it ought to use different types for different memory layouts.
The HVAs and splitting up of structs make my back-hair curl when I think 
of LDR & BE.

I still hope that I just don't know enough to see it's all ok.

> If those semantics match a particular ld1 instruction (and it would be
> fastest), we should use it. Regardless of whether the user got here
> via undefined behaviour, assuming big-endian was actually little,
> actual intent or sheer good fortune.

Responding to our compiler customers, the large majority of supplied 
code was not ANSI conforming - there's the standard that only compiler 
writers ever read - and there's public expectations.

I for one would really expect not to have to reverse my array indices 
for BE.

>> So I can't really recommend allowing regular pattern matching to emit 
>> array
>> loads, when most other loads use a different layout.
> 
> I doubt we'd get there for array loads anyway: arrays aren't backend
> value types so they're gone by the time the AArch64 code gets
> involved.

yeah - so you get a pointer to single and don't know whether it was a 
struct or vector or an array that was torn to pieces - what kind of load 
do you use for the next 4 elements - LD1 or LDR ?

You don't know how long the array was and whether your pointer points to 
an multiple-of-4 index.

I really hope, vectors are never lowered in any way.
However definitions like

typedef __attribute__((neon_vector_type(2)))  float32_t float32x2_t;

typedef struct float32x2x2_t {
   float32x2_t val[2];
} float32x2x2_t;

make me already wonder whether float32x2x2 should be loaded to one or 
two neon regs.
Parameter passing is probably garbled by STR.

>> You might say "But reading array type data is fine".
>> You're right - but that <4 x single>* is so easily casted to 
>> single[4]* -
>> after all C doesn't have the strongest type system, and it normally 
>> just
>> works - especially on AARCH64_LE.
> 
> That's almost certainly implementation-defined behaviour at best.

Maybe - casting homogeneous-type aggregates to arrays of the same basic 
type is generally considered straight forward.
Still very strange, unreasonable and unexpected.

> We should not be accommodating it. Certainly not at the expense of
> well-behaved code by people who *do* know what they're doing.

Hey - that's actually been my argument most of the time - I'm getting 
old ;-)

>>  - and the arm_neon.h doesn't look terribly type-safe as well:
>> 
>> #define vst1q_f32_x2(__a, b) __extension__ ({ \
>>   float32x4x2_t __b = (b); \
>>   __builtin_neon_vst1q_x2_v(__a, (int8x16_t)__b.val[0],
>> (int8x16_t)__b.val[1], 41); })
> 
> You're probably right. I've avoided thinking too hard about that
> particular wrinkle because it might distress me; as long as it works
> for the cases that *are* valid, I'd rather not know.

intriguing strategy - may I copy that ?

>> That said: I currently follow the rule "single element is fine" (no 
>> swapping
>> surprises possible).
>> If you can come up with a rule that allows more instructions/patterns 
>> but is
>> still safe, I'm all ears.
> 
> I'm not sure I follow this question, but my primary heuristic would
> be: assume that whatever front-end or person that came up with the
> LLVM IR you're processing knew what they were on about, and generate
> the best code you can for it.
> 
> My biggest message is that, to disable the lane or duplicating
> loads/stores, we need an LLVM IR fragment (without @llvm.aarch64.neon
> intrinsics preferably) that would be miscompiled on big-endian if they
> were allowed.

Ok, if you LGTM that, I'll change the predicates to comments and let 
someone else find the failing patterns.
We can always hide behind "undefined behaviour" signs when the mob 
arrives.

Cheers,
Albrecht