[PATCH] AARCH64_BE load/store rules fix for ARM ABI
Tim Northover
t.p.northover at gmail.com
Fri Mar 14 01:05:01 PDT 2014
Hi Jiangning,
>> declare void @foo(<4 x i16>)
>> define void @bar(<4 x i16>* %addr) {
>> %vec = load <4 x i16>* %addr
>> call void @foo(<4 x i16> %vec)
>> ret void
>> }
>>
>> The AAPCS requires us to pass %vec *as if* it had been loaded by
>> "ldr", so if we use "ld1 {v0.4h}" we need some kind of "rev"
>> instruction to reformat it before the call. Otherwise a round-trip via
>> GCC (for example) could produce incorrect results (say, if @foo simply
>> stored the vector back to %addr).
>>
>
> For "rev" instructions, ARMv8ARM says "An application or device driver might
> have to interface to memory-mapped peripheral registers or shared memory
> structures that are not the same endianness as the internal data
> structures.", so we would only need this instruction if we want to interact
> between little-endian and big-endian. For the scenario of supporting an
> unique one only, either of big or little, we needn't to use this instruction
> at all, because with the hardware endianness support, ldr/ld1 could always
> behave correctly for different endianness.
I don't think we're understanding each other. The instructions may
have been created for I/O purposes, but if we load vectors via ld1/st1
then we'll also have to use them to ensure ABI compatibility.
I assume you have access to a big-endian simulator at least. Try
compiling this with GCC:
#include <arm_neon.h>
extern int32x2_t var;
void foo(int32x2_t in) {
var = in;
}
and this with LLVM (or mentally, if Clang doesn't do what you want):
#include <stdio.h>
#include <arm_neon.h>
int32x2_t var;
extern void foo(int32x2_t);
int main() {
var = vset_lane_s32(1, var, 0);
var = vset_lane_s32(2, var, 1);
foo(var);
printf("%d %d\n", vget_lane_s32(var, 0), vget_lane_s32(var, 1));
}
I think we can both agree that this should print "1 2", but I think
you'll find extra REVs are needed for compatibility if you decide
Clang must use ld1/st1.
> For the bitcast, I don't think we should generate any instruction. If only
> ldr/str and ld1/st1 can be used in pair, we shouldn't have any issue at all.
Agreed, until we hit an ABI-visible boundary. Unfortunately it's also
essentially impossible come up with rules other than the degenerate
("always use ldr/str" or "always use ld1/st1") to ensure that they
*are* only used in pairs like that.
> I would say for this case, to support strict mode, we should have to use
> ld1, although the address might have been expanded to 8-byte aligned,
> because "align 2" implies the data is from an array of elements,
>From LLVM's perspective the "align 2" is an optimisation hint, with no
effect on semantics. It *will* be changed by optimisers, if they spot
that a larger alignment can be guaranteed by whatever means they
choose.
We cannot change that (and I wouldn't want to even if I could, it
would introduce a shadow type system and be horrible) and have to make
codegen work regardless.
>> I don't believe so. Clang will generate array accesses as scalar
>> operations. The vectorizer may transform them, but only into
>> well-specifier LLVM IR. How we implement that is our choice, and we
>> can use either ld1 or ldr. Do you have a counter-example?
>>
> Yes, we can apply optimization, but we should change the semantic interface
> crossing functions. My example is in a .h file, if we define,
>
> extern int16_t a[4];
>
> In function f1 defined in file file1, and function f2 in file file2, we
> should guarantee to use ld1/st1 to load/store variable a.
We should make no such guarantees. We should be free to use whatever
instructions we like as long as the semantics of C and LLVM IR are
preserved.
> In C, we say (a++ == &a[1]) is true,
> this should be guaranteed for big-endian as well.
Definitely, but we can still use ldr/str and make this work. It
involves remapping all lane-referencing instructions during codegen.
To make the example more concrete, under the ldr/str scheme (assuming
alignment can be preserved), a function like:
int16_t foo() { return a[0]; }
Might produce (for some odd reasons, but it's valid IR: we guarantee
element 0 has the lowest address even in big-endian):
define i16 @foo() {
%val = load <4 x i16>* bitcast([4 x i16]* @a to <4 x i16>*)
%elt = extractelement <4 x i16> %val, i32 0
ret i16 %elt
}
This could then be assembled (assuming the alignment requirements are met) to:
foo:
adrp, x0, a
ldr d0, [x0, :lo12:a]
umov w0, v0.h[3]
Note the "[3]", rather than "[0]".
Cheers.
Tim.
More information about the llvm-commits
mailing list