[PATCH] AARCH64_BE load/store rules fix for ARM ABI
Albrecht Kadlec
akadlec at a-bix.com
Mon Mar 3 11:46:59 PST 2014
On 2014-03-03 16:41, Tim Northover wrote:
>> I think the problem will actually come with the intrinsics, where we
>> probably want to generate this sequence from "vld1_lane_s32(addr,
>> vec,
>> 3)"
>
> I no longer believe this. I'm not sure what I *do* believe though.
> Probably something about precisely what non-intrinsic C could would
> generate the original IR.
>
> Tim.
Hi Tim,
That was exactly the point when I disabled _all_ suspicious
instructions/patterns.
(Caveat: I'm not fully fluent in all neon array stores yet - nor in the
intrinsics to use them)
By now my view is as follows:
Since STR/LDR swap the elements, compared to LD1/ST1 (array loads),
either the in-register layout must be reverse or the in-memory layout.
Since we want to keep computing in the registers without knots in our
brains, we keep that the same - so memory format differs between STR and
ST1.
However: we only have one .ll type for that <4 x i32>
So in BE, we can't allow both stores for the following - they would
result in different memory layout - and how would the one who reads know
what to do:
define void @test_store_v4i32(<4 x i32>* %ptr, <4 x i32> %val) #0 {
; CHECK: test_store_v4i32
; CHECK: str {{q[0-9]+}}, [{{x[0-9]+}}]
entry:
store <4 x i32> %val, <4 x i32>* %ptr, align 16
ret void
}
Having done some geometry, there're some applications for 4x4 lattices -
e.g. to compute perspective (sp?) views, etc.
So, 25 years ago I'd have thought "gee - there's a <4 x single> register
type, I'll define the lattice to be 4 of them".
Then to transpose such a lattice I'd have used the lane load intrinsics
- why not: seems to be the intended use.
And now I'm bummed:
Those nifty <XYZW> vectors have been stored to memory as <WZYX> by STR -
so the lane load that should have loaded Y actually loads Z.
My loaded "transposed" lattice is anything but the transposed.
Sure we can document this BE-only behaviour (actually ARM should really
do that VERY PROMINENTLY in the ABI).
We can also damn all those programmers who point out "On my LE machine
it totally worked".
For me it's a BE-only type clash that can only be solved by not mixing
LDR type loads with array type loads.
So I can't really recommend allowing regular pattern matching to emit
array loads, when most other loads use a different layout.
The only exception is the use of intrinsics - in the hope that whoever
read the intrinsics header file also read the prominent note there
referring to the prominent note in the ABI - and knows what to expect
for BE.
You might say "But reading array type data is fine".
You're right - but that <4 x single>* is so easily casted to single[4]*
- after all C doesn't have the strongest type system, and it normally
just works - especially on AARCH64_LE.
- and the arm_neon.h doesn't look terribly type-safe as well:
#define vst1q_f32_x2(__a, b) __extension__ ({ \
float32x4x2_t __b = (b); \
__builtin_neon_vst1q_x2_v(__a, (int8x16_t)__b.val[0],
(int8x16_t)__b.val[1], 41); })
I guess my opposition boils down to: Who expects that <4 x single> is
stored in reverse order and can't be sourced into the complex loads
(e.g. lane loads), when a single cast just makes it work on LE.
Biggest trouble is that nobody ever reads the documentation - even if
you could read it there - and most chips will be LE, where it "just
worked".
That's a recipe for disaster.
If I had to use neon on BE, I'd:
1) use intrinsics for all instructions, arrays for all data (-> index
computations!).
2) try not to pass parameters by value (const ref) - except for leaf and
near-leaf functions operating on registers. (32 vector regs still feel
register pressure from lattices)
3) probably wrap the ugliness in some vector/lattice classes.
As I said before: the correct solution would be separate memory data
types and separate nonterminals - with conversion chain rules and a PBQP
matcher to do those conversions optimally - you'd still have to live
with quite a few swap instructions for BE.
That said: I currently follow the rule "single element is fine" (no
swapping surprises possible).
If you can come up with a rule that allows more instructions/patterns
but is still safe, I'm all ears.
Cheers,
Albrecht
More information about the llvm-commits
mailing list