[PATCH] AARCH64_BE load/store rules fix for ARM ABI

Thu Mar 6 19:15:49 PST 2014

  Hi Tim,

  > I'm not sure I understand your comments. Do you mean ARM is intending to add C level types to ACLE & AAPCS that *will* behave as if loaded and stored with ld1/st1 soon?

  No, I didn't mean that. We should follow AAPCS64. AAPCS64 says,

  "Elements in a short vector are numbered such that the lowest numbered element (element 0) occupies the lowest numbered bit (bit zero) in the vector and successive elements take on progressively increasing bit positions in the vector. When a short vector transferred between registers and memory it is treated as an opaque object. That is a short vector is stored in memory as if it were stored with a single STR of the entire register; a short vector is loaded from memory using the corresponding LDR instruction. On a little-endian system this means that element 0 will always contain the lowest addressed element of a short vector; on a big-endian system element 0 will contain the highest-addressed element of a short vector."

  All these statements are talking about the short vector with total size alignment. However, for the LLVM IR, we have the case of element size alignment short vector, which should not simply fall into this category. It should be treated as an array of elements, and using ld1/st1 to completely match this semantic, and we don't have semantic difference for ld1/st1 between LE and BE except the data layout inside element.

  For total size aligned short vector, ld1/st1 have the same semantics as ldr/str on little-endian. We prefer to use ldr/str because they have better addressing modes than ld1/st1. On big-endian, we should only use ldr/str to meet semantic requirement.

  Thanks,
  -Jiangning


================
Comment at: lib/Target/AArch64/AArch64InstrNEON.td:3362
@@ +3361,3 @@
+//
+// Obviously the two layouts differ by reversing the elements so they can't be 
+// mixed without explicit element-swap operations in BE.
----------------
Tim Northover wrote:
> Jiangning Liu wrote:
> > How do we come across a case mixing the uses of LDR and LD1? If it's type casting, end-user should guarantee the correctness by program logic itself rather than by compiler.
> The compiler was mixing them at will previously (e.g. storeRegToStackSlot uses str, but this address could escape and be used in a normal load which we'd use ld1 for). I believe Albrecht's comment is designed to warn against this, and I support it.
We should avoid mixing the use of ld1 and ldr. storeRegToStackSlot should decide to use ld1 or ldr by checking the alignment. If it is not an element alignment, but a whole short vector alignment, we should use ldr, while for other cases, we should use ld1. This way, we should be able to always keep endianess correctness and we should not have mixing issue.

================
Comment at: lib/Target/AArch64/AArch64InstrNEON.td:3417
@@ +3416,3 @@
+// will be inconsistent.
+// The only allowed use of LD1 is in initializations using explicit intrinsics to do 
+// the element-swaps.
----------------
Tim Northover wrote:
> Jiangning Liu wrote:
> > This is not the only case. Auto-vectorizer could generate element alignment short vector ld/st. For example, middle-end could generate
> > 
> > store <4 x i16> %val, <4 x i16>* %ptr, align 2
> > 
> > We should generate instruction like st1 v0.4h, [x0].
> > 
> > Unfortunately, we can't generate this instruction yet with trunk. We will get it fixed as soon as possible.
> I don't believe we're forced to generate either and there are arguments in favour of both, but being consistent is *very* important. As Albrecht said, we can't mix the two kinds of load/store.
> 
> I agree that using ld1/st1 exclusively would make LLVM's semantics easier to get right, but it would make getting the AAPCS right harder (bitcasts would become non-trivial operations and be needed at all potentially ABI-visible boundaries).
> 
> I suspect (but don't know) that the ldr/str route is capable of producing better code on average.
I don't think we're forced to generate either as well, but we should keep semantic correctness by choosing either in terms of alignment.

Actually we don't really violate AAPCS at all. AAPCS says, "A short vector has a base type that is the fundamental integral or floating-point type from which it is composed, but its alignment is always the same as its total size.".

If the memory address is not total size aligned, it is not a "short vector" definition in AAPCS. It should be treated as an array, which is usually generated from auto vectorizer, so we prefer to generate ld1/st1 for it. 

================
Comment at: lib/Target/AArch64/AArch64InstrNEON.td:3483
@@ -3438,2 +3482,3 @@
 
-defm ST2 : STVList_BHSD<0b1000, "VPair", "st2">;
+// Multiple elements would be reversed in BE.
+let Predicates = [IsLE] in {
----------------
Tim Northover wrote:
> Jiangning Liu wrote:
> > ST1/ST2/ST3/ST4 essentially use aggregate short vector type like,
> > 
> > typedef struct int16x4x3_t {
> >   int16x4_t val[3];
> > } int16x4x3_t;
> > 
> > which is defined in arm_neon.h.
> > 
> > With this data type, LE/BE should only make difference for the layout inside element int16. The data layout among different elements should be always the same.
> I believe this is incorrect for the simple instructions. "ld1 {v0.4h, v1.4h}, [x0]" is equivalent to "ld1 {v0.4h}, [x0]; ld1 {v1.4h}, [x0, #8]" and different from "ldr d0, [x0]; ldr d1, [x0, #8]" on big-endian systems.
I don't think I meant ld1 and ldr have the same sementic between LE and BE systems. I agree with your statement. What I meant is ld1/st1 should always have the same semantic between LE and BE systems except the data layout inside the element. We should choose ldr or ld1 in terms of alignment on IR. If it is total size aligned access, we use ldr, and otherwise we use ld1.


http://llvm-reviews.chandlerc.com/D2884