[PATCH] AARCH64_BE load/store rules fix for ARM ABI

Mon Mar 3 11:46:59 PST 2014

On 2014-03-03 16:41, Tim Northover wrote:
>>   I think the problem will actually come with the intrinsics, where we
>>   probably want to generate this sequence from "vld1_lane_s32(addr, 
>> vec,
>>   3)"
> 
> I no longer believe this. I'm not sure what I *do* believe though.
> Probably something about precisely what non-intrinsic C could would
> generate the original IR.
> 
> Tim.

Hi Tim,

That was exactly the point when I disabled _all_ suspicious 
instructions/patterns.
(Caveat: I'm not fully fluent in all neon array stores yet - nor in the 
intrinsics to use them)

By now my view is as follows:

Since STR/LDR swap the elements, compared to LD1/ST1 (array loads), 
either the in-register layout must be reverse or the in-memory layout.
Since we want to keep computing in the registers without knots in our 
brains, we keep that the same - so memory format differs between STR and 
ST1.

However: we only have one .ll type for that <4 x i32>
So in BE, we can't allow both stores for the following - they would 
result in different memory layout - and how would the one who reads know 
what to do:

define void @test_store_v4i32(<4 x i32>* %ptr, <4 x i32> %val) #0 {
; CHECK: test_store_v4i32
; CHECK: str	 {{q[0-9]+}}, [{{x[0-9]+}}]
entry:
   store <4 x i32> %val, <4 x i32>* %ptr, align 16
   ret void
}

Having done some geometry, there're some applications for 4x4 lattices - 
e.g. to compute perspective (sp?) views, etc.

So, 25 years ago I'd have thought "gee - there's a <4 x single> register 
type, I'll define the lattice to be 4 of them".

Then to transpose such a lattice I'd have used the lane load intrinsics 
- why not: seems to be the intended use.

And now I'm bummed:
Those nifty <XYZW> vectors have been stored to memory as <WZYX> by STR - 
so the lane load that should have loaded Y actually loads Z.

My loaded "transposed" lattice is anything but the transposed.

Sure we can document this BE-only behaviour (actually ARM should really 
do that VERY PROMINENTLY in the ABI).

We can also damn all those programmers who point out "On my LE machine 
it totally worked".

For me it's a BE-only type clash that can only be solved by not mixing 
LDR type loads with array type loads.

So I can't really recommend allowing regular pattern matching to emit 
array loads, when most other loads use a different layout.

The only exception is the use of intrinsics - in the hope that whoever 
read the intrinsics header file also read the prominent note there 
referring to the prominent note in the ABI - and knows what to expect 
for BE.

You might say "But reading array type data is fine".
You're right - but that <4 x single>* is so easily casted to single[4]* 
- after all C doesn't have the strongest type system, and it normally 
just works - especially on AARCH64_LE.

  - and the arm_neon.h doesn't look terribly type-safe as well:

#define vst1q_f32_x2(__a, b) __extension__ ({ \
   float32x4x2_t __b = (b); \
   __builtin_neon_vst1q_x2_v(__a, (int8x16_t)__b.val[0], 
(int8x16_t)__b.val[1], 41); })

I guess my opposition boils down to:  Who expects that <4 x single> is 
stored in reverse order and can't be sourced into the complex loads 
(e.g. lane loads), when a single cast just makes it work on LE.

Biggest trouble is that nobody ever reads the documentation - even if 
you could read it there - and most chips will be LE, where it "just 
worked".
That's a recipe for disaster.

If I had to use neon on BE, I'd:
1) use intrinsics for all instructions, arrays for all data (-> index 
computations!).
2) try not to pass parameters by value (const ref) - except for leaf and 
near-leaf functions operating on registers. (32 vector regs still feel 
register pressure from lattices)
3) probably wrap the ugliness in some vector/lattice classes.

As I said before: the correct solution would be separate memory data 
types and separate nonterminals - with conversion chain rules and a PBQP 
matcher to do those conversions optimally - you'd still have to live 
with quite a few swap instructions for BE.

That said: I currently follow the rule "single element is fine" (no 
swapping surprises possible).
If you can come up with a rule that allows more instructions/patterns 
but is still safe, I'm all ears.

Cheers,
Albrecht