[PATCH] AARCH64_BE load/store rules fix for ARM ABI

Sun Mar 16 20:37:13 PDT 2014

Hi Tim,

>
> 
> > int16_t a[4];
> > int16x4_t b;
> >
> > Global variables a and b are different. a requires a fixed element
> layout,
> > but b has different element layout for big-endian and little-endian,
> > although both a and b have different data layout within element.
> >
> > Programmers or tools can't change this, otherwise it would be software
> > ecosystem disaster.
> >
> > With your solution, it would introduce a lot of 'strange' code, if use
> > ldr/str for variable a in big-endian mode. It's unacceptable for me.
>
> I really don't believe it would. Could you give an end-to-end example
> of such a disaster in my scheme? C code, LLVM IR and assembly? We can
> discuss just where we start to disagree about the correct
> transformations.
>

{code}
// This is scenario 1, and I will give scenario 2 later on
$ cat common.h
// In a common C head file common.h for programmer p1 and p2 to use
#include <stdint.h>
extern int16_t a[128];
$ cat file1.c
// Programmer p1 writes code in file1
#include "common.h"
int main()
{
    int i;
    for (i=0; i<128; i++) {
        a[i] = i;
    }
    // And then write a to output file "data" char by char.
}
$ cat file2.c
// Programmer p2 writes code in file2
#include "common.h"
#define FAIL 1
int main()
{
    int i;
    // Read from file 'data' to a char by char
    for (i=0; i<128; i++) {
        if (a[i] != i) return FAIL;
    }
}
{code}

This is a very common scenario in big-endian programming word, and
programmers use the common head file to force the protocol among different
modules.

Note that, essentially p1 and p2 don't really care if their code is
auto-vectorized or not for each other at all. For some reason, p1 wants
high performance, but p2 wants low power, so the code in file1 is
vectorized, while the code in file2 isn't.

Suppose file1 can be vectorized, and we generate code like below,

{code}
; ModuleID = 'file1.c'
target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnuabi"

@a = external global [128 x i16]

; Function Attrs: nounwind
define i32 @main() #0 {
entry:
  br label %vector.body

vector.body:                                      ; preds = %vector.body,
%entry
  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
  %0 = trunc i64 %index to i32
  %broadcast.splatinsert6 = insertelement <8 x i32> undef, i32 %0, i32 0
  %broadcast.splat7 = shufflevector <8 x i32> %broadcast.splatinsert6, <8 x
i32> undef, <8 x i32> zeroinitializer
  %induction8 = add <8 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2, i32
3, i32 4, i32 5, i32 6, i32 7>
  %1 = trunc <8 x i32> %induction8 to <8 x i16>
  %2 = getelementptr inbounds [128 x i16]* @a, i64 0, i64 %index
  %3 = bitcast i16* %2 to <8 x i16>*
  store <8 x i16> %1, <8 x i16>* %3, align 2
  %index.next = add i64 %index, 8
  %4 = icmp eq i64 %index.next, 128
  br i1 %4, label %for.end, label %vector.body, !llvm.loop !1

for.end:                                          ; preds = %vector.body
  ret i32 0
}
{code}

So now the question is for "store <8 x i16> %1, <8 x i16>* %3, align 2",
what do you want to generate in assembly code?

With my solution I want to simply generate st1 for this LLVM IR, simply
because it is with 'align 2'.

Whit your solution, what do you want to generate?
If you want to generate str, I think the result is incorrect, because it
would fail in strict mode, and another programmer p2 would see FAIL either.
If you want to generate "rev16+ st1", another programmer p2 would see FAIL.

> >> whether this is the ldr/str representation or the ld1/st1
> >> representation is less important to me,
> >
> >
> > I think it does matter because ldr/str and ld1/st1 have different data
> > layout semantics.
>
> I'm arguing that we could write a correct compiler that did either.
> Which instruction we use should be completely invisible.
>
> We could actually go further; we could write a compiler where, on
> every load and store we make sure the lanes are in the following
> order: {3, 1, 4, 5, 2, 6, 7, 0} in the vector register. It would be
> completely perverse, but we could make it work if we tried hard
> enough.
>

Of course this is unrealistic and I think it's unnecessary.

>
> > Yes, we should avoid mixing them. The issue is how to guarantee a stable
> > interface crossing different modules.
> > Only using ldr/str for big-endian would introduce a lot of strange code
> in
> > big-endian binary. Given that we have ld1/st1, why do we need those
> strange
> > code?
>
> Efficiency.

If the reason of using ldr/str is only for performance, then I'm OK. I
have been thought this was a correctness issue rather than a performance
issue only. If this is only for performance benefit, we needn't to discuss
any more.

> The advantages of using ldr are:
> 1. We get to use normal adrp/ldr addressing for globals instead of
> adrp/add/ld1
> 2. We don't have to permute vectors being passed or returned from
> function calls.
>

Do you mean this permutation is introduced by AAPCS64, because it requires
to use ldr/str?

> Maybe it's not worth it, or maybe it's worth it in some circumstances
> but not others. Perhaps uses within a function outweigh uses across
> ABI boundaries and we should attach the rev16 to the ldr instead.
>
> But I think it's very important that we understand REVs will be needed
> on one of them to preserve backend consistency.
>
> > 4) Can you explain when ld1/st1 will be used for big-endian mode with
> your
> > solution? What is the algorithm of generating ld1/st1 for compiler for
> > big-endian? Or you are proposing not using ld1/st1 forever?
>
> I propose: use ld1/st1 when the alignment requirements make it
> necessary. Use them with a pattern like:
>
> 
> let Requires = [IsBE] in
> def : Pat<(v4i16 (unaligned_load addr:$Rn)), (REV16 (LD1 addr:$Rn))>;
>

Well, what unaligned_load is here? I'm confused again! Isn't just to check
not total size alignment as I described in my proposal?

Instead we want to simply use the pattern below,

let Requires = [IsBE] in
def : Pat<(v4i16 (unaligned_load addr:$Rn)), (LD1 addr:$Rn)>;

If you add rev16 here, you would have to do the action like changing [0] to
[3] for element access. This is wired.

>
> (We might be able to do some optimisations to fold some of those REVs,
> but that would presumably come later).
>
> > If you
> > say for this case, the solution would be to use ld1/st1, then is it just
> to
> > check alignment to decide which instruction we should use?
>
> Yes. Given that ld1/st1 would require an extra REV, we would almost
> always prefer ldr/str. The exception being when alignment prohibited
> it.
>
> > Anyway, I don't think the solution of doing things like changing [0] to
> [3]
> > and inserting rev instruction by using some 'smart algorithm' in
> compiler is
> > a reasonable choice.
>
> Fair enough on the [0] to [3], but the REV problem isn't going away:
> if we mix ld1/st1 and ldr/str, one of them will have to have a REV
> attached as the default situation.
>

Well. How to mix them? bitcast introduced by auto-vectorizer can't justify
it. If it is true in auto-vectorizer, I think it would be a bug, but I
doubt auto-vectorizer is generating this.

Can you also give me a real case in C code, and show me the IR that we
can't simply use ld1/st1 without rev16?

>
> As I said, I agreed with you about that a couple of weeks ago. Both
> representations could be made to work. In this case, the warts are in
> function argument passing and bitcasts:
>
> define i16 @foo(<4 x i16> %in) {
>   %elt = extractelement <4 x i16> %in, i32 0
>   ret i16 %elt
> }
>
> In the ld1/st1 world, this would have to compile to:
> foo:
>     rev16 v0.4h, v0.4h
>

I don't think rev16 is necessary here. As you mentioned we should keep an
unique data layout in register. If you don't use rev after ld1, this rev
would not be necessary at all.

>     umov w0, v0.h[0]
>     ret
>
> Many of those REVs will be foldable of course. In theory. Doing it
> well across basic-block boundaries strikes me as a difficult problem
> in LLVM.
>
> >     define i16 @foo() {
> >       %val = load <4 x i16>* bitcast([4 x i16]* @a to <4 x i16>*)
> >       %elt = extractelement <4 x i16> %val, i32 0
> >       ret i16 %elt
> >     }
> >
> > For this case, I assume optimizer wants to generate it. Then I would say
> > optimization is generating invalid code for big-endian.
> >
> > I understand auto-vectorizer want to utilize this kind of casting to
> > generate code using vector type. To achieve this goal, I think for
> > big-endian, we should not only introduce bitcast, but "rev" in LLVM IR.
>
> That solution is isomorphic to reversing the labelling of lanes in
> big-endian (i.e. saying that element 0 has the highest address). It's
> not going to happen. People discussed what the LLVM IR should mean a
> long time ago and decided element 0 has the lowest address, even in
> big-endian.

Well, this surprised me.

The int16x8_t defined in arm_neon.h is a machine vector type, I think, and
it intends to map hardware VPR. So for big-endian, all bytes should be
reversed if use str follwoing AAPCS64. The vector in LLVM IR only
represents symbol, the memory address should not be imposed until
load/store time, and big-endian/little-endian should only affect the memory
written rather than the content in register.

Consider the following vectorized version of C code we manually written,

{code}
// this is scenario 2.
$ cat common1.h
// In a common C head file common.h for programmer p1 and p2 to use
#include <arm_neon.h>
extern int16x8_t a[16];
$ cat fill3.h
cat: fill3.h: No such file or directory
jialiu01 at jialiu01-server:~/test/endian$ cat file3.c
// Programmer p1 writes code in file1
#include "common1.h"
int main()
{
    int i;
    int16_t j;
    for (i=0; i<16; i++) {
        j = 8*i;
        a[i] = (int16x8_t){j, j+1, j+2, j+3, j+4, j+5, j+6, j+7};
    }
    // And then write a to output file "data" char by char.
}
$ cat file4.c
// Programmer p2 writes code in file2
#include "common1.h"
#define FAIL 1
int main()
{
    int i;
    int j;
    // Read from file 'data' to a char by char
    for (i=0; i<16; i++) {
        for (j=0; j<8; j++) {
            if (a[i][j] != i*8+j) return FAIL;
        }
    }
}
{code}

I think it's a valid one. With my solution, it's very simple, str will be
used file3.c, while ldr will be used for file4.c, because they have the
following LLVM IR generated respectively.

store <8 x i16> %vecinit27, <8 x i16>* %arrayidx, align 16   // in LLVM IR
of file3.c

%0 = load <8 x i16>* %arrayidx, align 16  // in LLVM IR of file4.c

We needn't rev16 here. Note that the byte ordering in the data file being
transferred between programmer p1 and p2 will be different from the cases I
gave by common.h/file1.c/file2.c.

In the file transferred of scenario 1, the char ordering in disk should be
like 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ..., 127. (let's ignore the ordering
difference for inside each number for big-endian and little-endian, because
we are not discussing about that). In the file transferred of scenario 2,
the char order in disk should be like 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, ... ,
127,..., 121, 120.

Of course, you can also exchange data without using file, and link the
binaries generated by p1 and p2, but the result would be the same.

Can you explain to me what instructions are going to generated with your
solution?

> PPC SIMD support (& possibly Mips) has been implemente
> 
> d
> around this decision.
>

I'm not sure if PPC is doing this way. Maybe you can prove me around this.
But even if other targets is implementing things this way, it can't simply
arguing it is a correct one. With my two-cents experience of software
engineering, sometimes people may use a ugly solution to solve a simple
problem.

Thanks,
-Jiangning
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140317/343958b7/attachment.html>