[PATCH] AARCH64_BE load/store rules fix for ARM ABI

Fri Mar 14 21:11:52 PDT 2014

Hi Tim and James,

After such a long discussion, I'm really hoping I can understand and agree
with you, but I still can't be convinced by your solution.

1) For this issue,

>  In C, we say (a++ == &a[1]) is true,
> > this should be guaranteed for big-endian as well.
>
> Definitely, but we can still use ldr/str and make this work. It
> involves remapping all lane-referencing instructions during codegen.
> To make the example more concrete, under the ldr/str scheme (assuming
> alignment can be preserved), a function like:
>
>     int16_t foo() { return a[0]; }
>
> Might produce (for some odd reasons, but it's valid IR: we guarantee
> element 0 has the lowest address even in big-endian):
>
> 
>     define i16 @foo() {
>       %val = load <4 x i16>* bitcast([4 x i16]* @a to <4 x i16>*)
>       %elt = extractelement <4 x i16> %val, i32 0
>       ret i16 %elt
>     }
>
> This could then be assembled (assuming the alignment requirements are met)
> to:
>     foo:
>         adrp, x0, a
>         ldr d0, [x0, :lo12:a]
>         umov w0, v0.h[3]
>
> Note the "[3]", rather than "[0]".
>
>
Can you explain how to get address &a[0] for big-endian? It's a C
expression, and I think it should be simply

Suppose the data in a[4] is,
0x00aabb00: 01
0x00aabb02: 02
0x00aabb04: 03
0x00aabb06: 04

I think &a[0] is 0x00aabb00. If you agree, then

int16_t *p = &a[0]; p += 3; return *p;

What the return value is for big-endian? I think the value should be 04
rather than 01. Following you assembly code you gave, the result would be
01. I don't think it make sense, and isn't a disaster for programmer.

2) For the case you gave,

> 
>
> define i16 @foo(i1 %tst, <4 x i16>* %arr_ptr, <4 x i16>* %vec_ptr) {
>   br i1 %tst, label %load_vec, label %load_arr
>
> load_vec:
>   %vec = load <4 x i16>* %vec_ptr, align 8
>   br label %end
>
> load_arr:
>   %arr = load <4 x i16>* %arr_ptr, align 2
>   br label %end
>
> end:
>   %val = phi <4 x i16> [%vec, %load_vec], [%arr, %load_arr]
>   %elt = extractelement <4 x i16> %val, i32 0
>   ret i16 %elt
> }
>
> Cheers.
>
> Tim.
>
> P.S. I argue (modulo bugs):
>
> foo:
>      cbz w0, .Lload_arr
> .Lload_vec:
>      ldr d0, [x2]
>      b .Lend
> .Lload_arr:
>     ld1 { v0.4h }, [x1]

I'm confused here! I think my solution is to ld1 in big-endian for align 2,
and you has been arguing not but use ldr.

    rev16 v0.8b, v0.8b
>

And with my solution, you needn't rev16 at all. If programmer is writing
code like this or compiler is generating code like this, programmer or
compiler should have known arr_ptr and vec_ptr have different data layout
for big-endian!

It's quite strage to add rev16 here! How and when do you decide to add
rev16 in compiler? Isn't it a disaster for compiler implementation?

> .Lend:
>     umov w0, v0.h[3]
>     ret
>
> With two possible viable alternatives:
> 1. Drop rev16. Only ever use ld1. umov can use lane [0].
> 2. Attach rev16 to ldr instead of ld1. umov can use lane [0].
>

3) For the following statement you gave,

> I think the fundamental issue is over in-register representations of
> vectors. I think the only way to make the backend work sanely (and
> possibly at all) is to demand just one representation everywhere:
>

I think this is the point I totally agree with you. However, I don't this
is the only one we should guarantee. Instead we should also guarantee the
semantic correctness in C level programming.

int16_t a[4];
int16x4_t b;

Global variables a and b are different. a requires a fixed element layout,
but b has different element layout for big-endian and little-endian,
although both a and b have different data layout within element.

Programmers or tools can't change this, otherwise it would be software
ecosystem disaster.

With your solution, it would introduce a lot of 'strange' code, if use
ldr/str for variable a in big-endian mode. It's unacceptable for me.

> whether this is the ldr/str representation or the ld1/st1
> representation is less important to me,

I think it does matter because ldr/str and ld1/st1 have different data
layout semantics.

> but we mustn't mix them for
> the sanity of everyone concerned.

Yes, we should avoid mixing them. The issue is how to guarantee a stable
interface crossing different modules.
Only using ldr/str for big-endian would introduce a lot of strange code in
big-endian binary. Given that we have ld1/st1, why do we need those strange
code?

4) Can you explain when ld1/st1 will be used for big-endian mode with your
solution? What is the algorithm of generating ld1/st1 for compiler for
big-endian? Or you are proposing not using ld1/st1 forever?

For strict mode, on LLVM IR if we see 'align 2', you are arguing to use
ldr/str for big-endian, then would it raise exception potentially? If you
say for this case, the solution would be to use ld1/st1, then is it just to
check alignment to decide which instruction we should use?

Anyway, I don't think the solution of doing things like changing [0] to [3]
and inserting rev instruction by using some 'smart algorithm' in compiler
is a reasonable choice.

5) James says,

> To weigh in my two-pennyworth, I agree with Tim. Alignment is a hint that
> could be changed or removed at any time, and it shouldn't change semantics
> in any other way than that increasing it could cause an alignment fault.
>

OK. If you don't think using 'alignment' to distinguish them is not a good
implementation in LLVM IR, then Albrecht's proposal of using a new data
type would be a choice, although I think it would introduce huge code
change, and it would not be realistic.

So far I don't really see a example which can break the solution of using
'alignment' to determine the instruction choice between ldr/str and
ld1/st1. Your solution is doing this,

a) introducing bitcasting because of optimizations
b) introducing element index change because of explicitly exposure to
compiler
c) introducing rev because of mixing the different layout

All those don't sound a simple and nice solution to me.

If bitcasting is introduced by programmer, I think programmers should
guarantee the semantic correctness by themselves. If bitcasting is
introduced by compiler, then optimizer should guarantee the semantic
correctness.

    define i16 @foo() {
      %val = load <4 x i16>* bitcast([4 x i16]* @a to <4 x i16>*)
      %elt = extractelement <4 x i16> %val, i32 0
      ret i16 %elt
    }

For this case, I assume optimizer wants to generate it. Then I would say
optimization is generating invalid code for big-endian. Can you tell the
difference between [4 x i16] and <4 x i16> in LLVM IR? LLVM IR doc says,

* The array type is a very simple derived type that arranges elements
sequentially in memory.
* A vector type is a simple derived type that represents a vector of
elements.

If you say, ([4 x i16] *) and (<4 x i16> *) are only pointers and we can
always type casting them, then I would say the bitcasting is trying to
change the semantics of interpreting the data stored by the pointer,
because following natural alignment rule, [4 x i16] and <4 x i16> has
different alignment. Therefore, I don't think this is a correct bitcasting
generated by optimizer for big-endian.

I understand auto-vectorizer want to utilize this kind of casting to
generate code using vector type. To achieve this goal, I think for
big-endian, we should not only introduce bitcast, but "rev" in LLVM IR.
Otherwise, the transformation doesn't really keep the original semantic in
LLVM IR. Finally, probably the result would be same as you described, some
"rev"s would be introduced in binary, but the solution is totally different.

Thanks,
-Jiangning
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140315/810e9998/attachment.html>