[PATCH] AARCH64_BE load/store rules fix for ARM ABI

Thu Mar 13 02:52:07 PDT 2014

Hi Jiangning,

> I don't this is a good example to illustrate the bad consequence. For this
> case, we would always have correct result for little-endian, right?

All of these instructions are equivalent (up to alignment faults) on
little-endian. If the data is retrieved then the semantics of ldr and
all ld1 instructions are identical so none of it matters.

> And for big-endian, the semantic should be guaranteed by the programmer, so
> I would treat it as invalid code, and programmer intends to mixing/casting
> the alignment. If programmer wants to have a 'stable' result for both
> 'inline' and 'non-inline', she/he should not explicitly write alignment like
> this.

I don't think we can unilaterally declare that LLVM IR to be invalid.
It has well-defined semantics (the result must be the lowest-addressed
element of @var) that we can't guarantee if we mix ld1 (except the .8b
and .16b variants) with ldr instructions.

> If this is generated by auto-vectorizer, I would expect the action of
> storing data into that var uses st1, then the 'var with alignment 8' would
> be transparent, I mean if only we always use ld1/st1 in pair, we would not
> have issue.

That's certainly an approach (one I favoured until recently). It would
be more friendly to LLVM's expectations, but harder for the AAPCS. I
believe for that to work bitcasts would have to become non-trivial,
and be inserted at all function-call boundaries involving vectors. For
example (illustrating function calls):

    declare void @foo(<4 x i16>)
    define void @bar(<4 x i16>* %addr) {
      %vec = load <4 x i16>* %addr
      call void @foo(<4 x i16> %vec)
      ret void
    }

The AAPCS requires us to pass %vec *as if* it had been loaded by
"ldr", so if we use "ld1 {v0.4h}" we need some kind of "rev"
instruction to reformat it before the call. Otherwise a round-trip via
GCC (for example) could produce incorrect results (say, if @foo simply
stored the vector back to %addr).

Similarly, an isolated bitcast is non-trivial:

    define void @foo(<4 x i16>* %in, <8 x i8>* %out) {
      %shortvec = load <4 x i16>* %in
      %charvec = bitcast <4 x i16> %shortvec to <8 x i8>
      store <8 x i8> %charvec, <8 x i8>* %out
      ret void
    }

Bitcast is specified in the LangRef to be equivalent to storing one
type and loading the other, so this fragment is (by definition of
bitcast) equivalent to:

    define void @foo(<4 x i16>* %in, <8 x i8>* %out) {
      %tmp = alloca <4 x i16>
      %shortvec = load <4 x i16>* %in
      store <4 x i16> %shortvec, <4 x i16>* %tmp
      %tmp.char = bitcast <4 x i16>* %tmp to <8 x i8>*
      %charvec = load <8  x i8>* %tmp.char
      store <8 x i8> %charvec, <8 x i8>* %out
      ret void
    }

Under the ld1/st1 scheme this could clearly generate code like:
     [...]
     ld1 {v0.4h}, [x0]
     st1 {v0.4h}, [sp]
     ld1 {v0.8b}, [sp]
     st1 {v0.8b}, [x1]
     [...]

If you want to go back to the original bitcast for efficiency
(skipping that middle st1/ld1 pair), you'll see that the bitcast has
to be a non-trivial operation (some kind of "rev" again, I believe).

It can be made to work in the backend (and would isolate the
big-endian changes more, which is why I preferred it earlier), but I
suspect it will produce worse code on the whole.

> As Albrecht mentioned, maybe the solution is "The frontend must give the
> backend a totally different type for short vectors". To some extension, I
> agree with his point.

I don't think there's any need for the backend to have dual types. All
programmer-visible semantics can be represented with just one,
provided the front-end is aware of the difference. But until such
types come along in ACLE (hopefully never), we probably needn't
discuss it.

>> If we decided to support strict alignment mode efficiently, we would
>> probably want to emit an "ld1 {v0.8b}" (i.e. always use the .8b or
>> .16b version), since that's got the same semantics as ldr.
>
> It is only true for little-endian, isn't it?

I don't believe so. I think "ld1 {v0.8b}" is equivalent to "ldr d0" on
both endians (and the equivalent 16b statement).

> If we always use ldr/str for bit-endian without caring about alignment,
> 1) It would not work for non-strict mode, because the address might be
> unaligned.

Unaligned loads are already expanded to support strict mode. Try compiling this:

    define <4 x i16> @foo(<4 x i16>* %addr) {
      %val = load < 4 x i16>* %addr, align 2
      ret <4 x i16> %val
    }

As part of the optimisation to make that more efficient, the
programmer would have to be wary of endian issues. It could be done in
either scheme, of course.

> 2) The data may come from a real array of elements rather than HVA, and the
> semantic of using ldr/str is conflict with the end-user's definition,

I don't believe so. Clang will generate array accesses as scalar
operations. The vectorizer may transform them, but only into
well-specifier LLVM IR. How we implement that is our choice, and we
can use either ld1 or ldr. Do you have a counter-example?

Basically, I think we can pick whatever in-register format we want,
and still comply with both IR semantics and the AAPCS.

But we *must* be consistent about it everywhere. Saying some vectors
will get loaded with "ldr" and others with "ld1" (except .8b/.16b) is
a recipe for disaster.

Cheers.

Tim.