[PATCH] AARCH64_BE load/store rules fix for ARM ABI

Tue Mar 18 00:23:29 PDT 2014

Hi Jiangning,

Thanks for the detailed example. I think the key line is:

>   %induction8 = add <8 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

> If you want to generate "rev16+ st1", another programmer p2 would see FAIL.

The constant vector in the "add" line is probably not what you were
expecting. Since it's basically saying "put 0 in element 0, 1 in
element 1, ..." its lanes are reversed like everything else; you might
write:
    movz w0, #7
    umov v0.h[0], w0
    [...]

More realistically, it would probably be a constpool load, where the
entry has been corrected for this. But I believe the test would pass
as a result.

> If the reason of using ldr/str is only for performance, then I'm OK. I have
> been thought this was a correctness issue rather than a performance issue
> only. If this is only for performance benefit, we needn't to discuss any
> more.

You'd only want to mix the two for performance reasons, but you have
to do it correctly.

>> The advantages of using ldr are:
>> 1. We get to use normal adrp/ldr addressing for globals instead of
>> adrp/add/ld1
>> 2. We don't have to permute vectors being passed or returned from
>> function calls.
>
>
> Do you mean this permutation is introduced by AAPCS64, because it requires
> to use ldr/str?

Yes. The permutations at function call boundaries are because of the AAPCS.

The ones at load/store (whether ldr or ld1) are because we need a
consistent in-register representation. And the ones at bitcast (when
ld1 is chosen as the primary representation) are because of the LLVM
LangRef and its definition of bitcast.

>> let Requires = [IsBE] in
>> def : Pat<(v4i16 (unaligned_load addr:$Rn)), (REV16 (LD1 addr:$Rn))>;
>
> Well, what unaligned_load is here? I'm confused again! Isn't just to check
> not total size alignment as I described in my proposal?

Pretty much, yes.

> Instead we want to simply use the pattern below,
>
> let Requires = [IsBE] in
> def : Pat<(v4i16 (unaligned_load addr:$Rn)), (LD1 addr:$Rn)>;

Yep. In that world we would have to write

let Requires = [IsBE] in
def : Pat<(v4i16 (aligned_load addr:$Rn)), (REV16 (LDR addr:$Rn))>;

when we wanted the efficiency gains of LDR.

> If you add rev16 here, you would have to do the action like changing [0] to
> [3] for element access. This is wired.

Agreed.

>> Fair enough on the [0] to [3], but the REV problem isn't going away:
>> if we mix ld1/st1 and ldr/str, one of them will have to have a REV
>> attached as the default situation.
>
> Well. How to mix them? bitcast introduced by auto-vectorizer can't justify
> it. If it is true in auto-vectorizer, I think it would be a bug, but I doubt
> auto-vectorizer is generating this.
>
> Can you also give me a real case in C code, and show me the IR that we can't
> simply use ld1/st1 without rev16?

In what way? When mixed with ldr? It's basically that branching one I
gave earlier in IR form:

float32_t foo(int test, float32_t *__restrict arr, float32x4_t *
__restrict vec) {
  float res;
  if (test) {
    arr[0] += arr[0];
    arr[1] += arr[1];
    arr[2] += arr[2];
    arr[3] += arr[3];
    res = arr[0];
  } else {
    *vec += *vec;
    res = *(float *)vec;
  }

  return res;
}

Can get converted to:

define float @foo(i32 %test, float* noalias nocapture %arr, <4 x
float>* noalias nocapture %vec, float* nocapture readnone %rhs) #0 {
entry:
  %tobool = icmp eq i32 %test, 0
  br i1 %tobool, label %if.else, label %if.then
if.then:                                          ; preds = %entry
  %0 = bitcast float* %arr to <4 x float>*
  %1 = load <4 x float>* %0, align 4, !tbaa !1
  %2 = fadd <4 x float> %1, %1
  %3 = bitcast float* %arr to <4 x float>*
  store <4 x float> %2, <4 x float>* %3, align 4, !tbaa !1
  br label %if.end
if.else:                                          ; preds = %entry
  %4 = load <4 x float>* %vec, align 16, !tbaa !5
  %add12 = fadd <4 x float> %4, %4
  store <4 x float> %add12, <4 x float>* %vec, align 16, !tbaa !5
  br label %if.end
if.end:                                           ; preds = %if.else, %if.then
  %.sink = phi <4 x float> [ %2, %if.then ], [ %add12, %if.else ]
  %5 = extractelement <4 x float> %.sink, i32 0
  ret float %5
}

If one branch uses ldr and the other ld1, then one of those branches
is going to have to insert a rev32 or the final block is impossible to
write code for.

>> define i16 @foo(<4 x i16> %in) {
>>   %elt = extractelement <4 x i16> %in, i32 0
>>   ret i16 %elt
>> }
>>
>> In the ld1/st1 world, this would have to compile to:
>> foo:
>>     rev16 v0.4h, v0.4h
>
>
> I don't think rev16 is necessary here. As you mentioned we should keep an
> unique data layout in register. If you don't use rev after ld1, this rev
> would not be necessary at all.

But the AAPCS requires %in to have been passed as if it was loaded
with ldr. The caller would have to perform a rev16 before the call
too, to be compliant and the two cancel out.

>         a[i] = (int16x8_t){j, j+1, j+2, j+3, j+4, j+5, j+6, j+7};
>  [...]
>             if (a[i][j] != i*8+j) return FAIL;
>
> I think it's a valid one. With my solution, it's very simple, str will be
> used file3.c, while ldr will be used for file4.c, because they have the
> following LLVM IR generated respectively.
>
> store <8 x i16> %vecinit27, <8 x i16>* %arrayidx, align 16   // in LLVM IR
> of file3.c
>
> %0 = load <8 x i16>* %arrayidx, align 16  // in LLVM IR of file4.c
>
> We needn't rev16 here. Note that the byte ordering in the data file being
> transferred between programmer p1 and p2 will be different from the cases I
> gave by common.h/file1.c/file2.c.

The test would pass regardless of whether you applied a rev16 wherever
it was appropriate. But if you didn't then the in-file layout would be
inconsistent with the LLVM IR generated by Clang.

The main reason for the ld1/rev or ldr/rev is because we *can't* make
sure that every ld1 is paired with an st1 rather than an str, and we
have to generate correct code regardless.

> In the file transferred of scenario 1, the char ordering in disk should be
> like 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ..., 127. (let's ignore the ordering
> difference for inside each number for big-endian and little-endian, because
> we are not discussing about that). In the file transferred of scenario 2,
> the char order in disk should be like 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, ... ,
> 127,..., 121, 120.

Not according to what Clang generates presently. Clang treats vec[j]
as starting from the lowest addressed element too (and the
initialisation you perform).

I believe the documentation is silent on that particular issue: it's a
C language extension and ACLE makes no mention of how it should
behave.

> Can you explain to me what instructions are going to generated with your
> solution?

I'd probably load <0, 1, 2, 3, 4, 5, 6, 7> into a vector outside the
loop (as discussed slightly above):

movz w0, #7
umov v0.h[0], w0
movz w0, #6
umov v0.h[1], w0
[...]

Then the main loop body would be (with w0 some induction variable and
x1 the address):
   dup v1.h, w0
   add v1.8h, v0.8h, v1.8h
   str q1, [x1], #16

On the other side something like this might be generated if the inner
loop was unrolled (again, w0 is an induction variable, 8*i here):
    ldr q0, [x1], #16
    umov w2, v0.8h[7] // From extractelement %vec, i32 0
    cmp w0, w2
    b.ne fail
    umov w2, v0.8h[6]
    add w3, w0, #1
    cmp w2, w3
    b.ne fail

> I'm not sure if PPC is doing this way. Maybe you can prove me around this.

PPC is working on the semantics of IR as it stands and as I'm
describing. That's one of the things you seem to be proposing we
change and will be opposed.

As to its individual instructions, It might not encounter these issues
if it doesn't have the same ld1/st1 instructions. I don't know either.

> But even if other targets is implementing things this way, it can't simply
> arguing it is a correct one. With my two-cents experience of software
> engineering, sometimes people may use a ugly solution to solve a simple
> problem.

Agreed. But personally I don't think this is one of those cases. Given
the inherent insanity of big-endian SIMD, I think the LLVM IR comes
out about as neatly as possible.

Cheers.

Tim.