[PATCH] AARCH64_BE load/store rules fix for ARM ABI

Thu Mar 13 23:30:38 PDT 2014

Tim,

Glad to know we get agreement now!

To make sure we are really on the same page, I still want to give some more
comments.

I think maybe AAPCS64 isn't clear enough to define what "element aligned
short vector" is, and current the "short vector" is always defined as
"total size aligned short vector". We should probably propose to add more
details around this in AAPCS64.

> That's certainly an approach (one I favoured until recently). It would
> be more friendly to LLVM's expectations, but harder for the AAPCS. I
> believe for that to work bitcasts would have to become non-trivial,
> and be inserted at all function-call boundaries involving vectors. For
> example (illustrating function calls):
>
>     declare void @foo(<4 x i16>)
>     define void @bar(<4 x i16>* %addr) {
>       %vec = load <4 x i16>* %addr
>       call void @foo(<4 x i16> %vec)
>       ret void
>     }
>
> The AAPCS requires us to pass %vec *as if* it had been loaded by
> "ldr", so if we use "ld1 {v0.4h}" we need some kind of "rev"
> instruction to reformat it before the call. Otherwise a round-trip via
> GCC (for example) could produce incorrect results (say, if @foo simply
> stored the vector back to %addr).
>
>
For "rev" instructions, ARMv8ARM says "An application or device driver
might have to interface to memory-mapped peripheral registers or shared
memory
structures that are not the same endianness as the internal data
structures.", so we would only need this instruction if we want to interact
between little-endian and big-endian. For the scenario of supporting an
unique one only, either of big or little, we needn't to use this
instruction at all, because with the hardware endianness support, ldr/ld1
could always behave correctly for different endianness.

> Similarly, an isolated bitcast is non-trivial:
>
>     define void @foo(<4 x i16>* %in, <8 x i8>* %out) {
>       %shortvec = load <4 x i16>* %in
>       %charvec = bitcast <4 x i16> %shortvec to <8 x i8>
>       store <8 x i8> %charvec, <8 x i8>* %out
>       ret void
>     }
>
> Bitcast is specified in the LangRef to be equivalent to storing one
> type and loading the other, so this fragment is (by definition of
> bitcast) equivalent to:
>
>     define void @foo(<4 x i16>* %in, <8 x i8>* %out) {
>       %tmp = alloca <4 x i16>
>       %shortvec = load <4 x i16>* %in
>       store <4 x i16> %shortvec, <4 x i16>* %tmp
>       %tmp.char = bitcast <4 x i16>* %tmp to <8 x i8>*
>       %charvec = load <8  x i8>* %tmp.char
>       store <8 x i8> %charvec, <8 x i8>* %out
>       ret void
>     }
>
> Under the ld1/st1 scheme this could clearly generate code like:
>      [...]
>      ld1 {v0.4h}, [x0]
>      st1 {v0.4h}, [sp]
>      ld1 {v0.8b}, [sp]
>      st1 {v0.8b}, [x1]
>      [...]
>
> If you want to go back to the original bitcast for efficiency
> (skipping that middle st1/ld1 pair), you'll see that the bitcast has
> to be a non-trivial operation (some kind of "rev" again, I believe).
>
>
For all the cases
given
 above, alignment isn't really specified, so llvm should just follow
default layout, that is, total size alignment.
Therefore
 we should always generate ldr
/str
for all of them.
 For the bitcast, I don't think we should generate any instruction. If
only ldr/str and ld1/st1 can be used in pair, we shouldn't have any issue
at all.
But "rev" should still be useful
for the scenario as described in ARMv8ARM.

> It can be made to work in the backend (and would isolate the
> big-endian changes more, which is why I preferred it earlier), but I
> suspect it will produce worse code on the whole.
>
> > As Albrecht mentioned, maybe the solution is "The frontend must give the
> > backend a totally different type for short vectors". To some extension, I
> > agree with his point.
>
> I don't think there's any need for the backend to have dual types. All
> programmer-visible semantics can be represented with just one,
> provided the front-end is aware of the difference. But until such
> types come along in ACLE (hopefully never), we probably needn't
> discuss it.
>
> >> If we decided to support strict alignment mode efficiently, we would
> >> probably want to emit an "ld1 {v0.8b}" (i.e. always use the .8b or
> >> .16b version), since that's got the same semantics as ldr.
> >
> > It is only true for little-endian, isn't it?
>
> I don't believe so. I think "ld1 {v0.8b}" is equivalent to "ldr d0" on
> both endians (and the equivalent 16b statement).
>
> > If we always use ldr/str for bit-endian without caring about alignment,
> > 1) It would not work for non-strict mode, because the address might be
> > unaligned.
>
> Unaligned loads are already expanded to support strict mode. Try compiling
> this:
>
>     define <4 x i16> @foo(<4 x i16>* %addr) {
>       %val = load < 4 x i16>* %addr, align 2
>       ret <4 x i16> %val
>     }
>
> As part of the optimisation to make that more efficient, the
> programmer would have to be wary of endian issues. It could be done in
> either scheme, of course.
>

I would say for this case, to support strict mode, we should have to use
ld1, although the address might have been expanded to 8-byte aligned,
because "align 2" implies the data is from an array of elements, and the
other model passing data to this function should have known the data will
be loaded by "align 2", so it uses st1. This should be guaranteed by the
interfaces between those two modules, i.e. the .h file in C, which should
be included by both modules, then the semantic can be guaranteed. For
whatever the optimizations are applied, we should not change this
interface, I think.

>
> > 2) The data may come from a real array of elements rather than HVA, and
> the
> > semantic of using ldr/str is conflict with the end-user's definition,
>
> I don't believe so. Clang will generate array accesses as scalar
> operations. The vectorizer may transform them, but only into
> well-specifier LLVM IR. How we implement that is our choice, and we
> can use either ld1 or ldr. Do you have a counter-example?
>
> Yes, we can apply optimization, but we should change the semantic
interface crossing functions. My example is in a .h file, if we define,

extern int16_t a[4];

In function f1 defined in file file1, and function f2 in file file2, we
should guarantee to use ld1/st1 to load/store variable a. This would
guarantee system works for both little-endian and big-endian, unless we use
different endianness for f1 and f2. In C, we say (a++ == &a[1]) is true,
this should be guaranteed for big-endian as well.

For local variable within a single function, yes, you can do whatever
optimizations you want, and the alignment can also be changed.

Thanks,
-Jiangning
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140314/aea93afa/attachment.html>