[PATCH] AARCH64_BE load/store rules fix for ARM ABI

Mon Mar 3 02:41:14 PST 2014

  Hmm, uploading a partially updated patch mostly hides previous comments (unexpected for me).
  Seems to have killed Tim's and my discussion on the inline-comments for the 1st diff.

  So here's the open question again:

  Is it ok to guard instruction defs with IsLE to hint that they might not be safe to use in BE ?
  Or add a comment instead ?
  If somebody adds a valid pattern (e.g. intrinsic) that predicate can be removed,
  If somebody adds a pattern valid only for either BE or LE, the predicate must go there.

  Fixes included in the next revison (today)


================
Comment at: lib/Target/AArch64/AArch64InstrNEON.td:3888
@@ -3866,2 +3887,3 @@
 
+let Predicates = [IsLE] in {
 // Load single 1-element structure to all lanes of 1 register
----------------
James Molloy wrote:
> Shouldn't a splatting LD1 still work in BE mode?
At most the ones that only read one element and duplicate that.

The multi-element reads will have unexpected order (that struct was STRed!), so can only be used via intrinsics to read from arrays.

================
Comment at: lib/Target/AArch64/AArch64InstrNEON.td:3908
@@ -3883,1 +3907,3 @@
+let Predicates = [IsLE] in {
+// element swap on bytes == byte swap
 def : LD1R_pattern<v8i8, i32, extloadi8, LD1R_8B>;
----------------
James Molloy wrote:
> This comment is cryptic. Also, why should we care about byte swapping here, we're under the IsLE predicate.
That's the reason for the IsLE - because it doesn't work for BE.
Pulled the comments out.

Now I think it's still wrong, as the elements would be read in ascending address order "array-like", while they have been stored reversed by STR.


================
Comment at: lib/Target/AArch64/AArch64InstrNEON.td:3931
@@ -3904,3 +3930,3 @@
 
-def : LD1R_pattern_v1<v1i64, i64, load, LD1R_1D>;
-def : LD1R_pattern_v1<v1f64, f64, load, LD1R_1D>;
+let Predicates = [IsLE] in {
+  def : LD1R_pattern_v1<v1i64, i64, load, LD1R_1D>;
----------------
James Molloy wrote:
> ld1.64 should be fine too, right? Because ld1.64 acts the same as LDR (byte swapping on a 64-bit value).
yes

================
Comment at: lib/Target/AArch64/AArch64InstrInfo.td:4879
@@ +4878,3 @@
+  // float is always there
+  // shouldn't this be guarded by HasFPARMv8 ???
+  defm : ls_neutral_pats<LOAD, STORE, Base, Offset, address, f16>;
----------------
James Molloy wrote:
> Please add a FIXME and remove trailing "??"
I was waiting for a reply from Tim -> remove the comment and do the right thing whatever that is.

I since figured, that if the predicates exist I'd better use them - ignoring the non-consistent picture so far.

Better a fail to match in the presence of upstream errors, than blowing up in the face of a customer's customer.

================
Comment at: lib/Target/AArch64/AArch64InstrInfo.td:4878
@@ +4877,3 @@
+                         dag Base, dag Offset, dag address> {
+  // float is always there
+  // shouldn't this be guarded by HasFPARMv8 ???
----------------
James Molloy wrote:
> What does this mean? is it a paste from the comment below? is fp16 always available in a64? Also capitalization, as Tim said.
The fp* rules have already been there before I added the vector types. They were unguarded. I kept it that way until we clarify the necessity of these guards.
It's not unimaginable that other FP hardware might emerge besides neon vfp.
I don't know and even manufacturers don't see the future market demands, yet.

================
Comment at: lib/Target/AArch64/AArch64InstrInfo.td:4859
@@ +4858,3 @@
+// NEON-BE: allow all neon vectors as well, since ld1/st1 must be disabled
+// LD1 & ST1 are not ABI conforming in big endian: wrong arg memory layout
+// http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf
----------------
James Molloy wrote:
> As the AAPCS is also not too clear on this, could you please add in the comment the *reason* ld1/st1 can't be used? More specifically than "wrong arg memory layout", it is because the LD1 performs lane-by-lane byte swapping, and LDR swaps the entire D/Q register.
Hmmm, somebody should tell ARM, so they can clarify that section ;-)

I'd written that to the beginning of the AArch64InstrNEON.td - but left it out again.
I'd still prefer to put that next to the NEON load-store instructions, where it belongs.
If you can bear with a larger paragraph of comments ...
DONE

================
Comment at: lib/Target/AArch64/AArch64InstrNEON.td:3996
@@ -3966,3 +3995,3 @@
 
-// Load single 1-element structure to one lane of 1 register.
-defm LD1LN : LDN_Lane_BHSD<0b0, 0b0, "VOne", "ld1">;
+let Predicates = [IsLE] in {
+  // Load single 1-element structure to one lane of 1 register.
----------------
James Molloy wrote:
> Will 1-element to 1-lane also work in BE mode?
added the following comment to the pattern and removed the predicate.

// This will not work as intended in BE mode, if the matcher generates it to
// load a vector to a lane. (STR q0 stored the elements swapped)
// Must always use an intrinsic, so the user knows it's loading from an array 
// layout.


================
Comment at: lib/Target/AArch64/AArch64InstrNEON.td:3461
@@ -3445,1 +3460,3 @@
+
 // Store multiple 1-element structures from N consecutive registers (N = 2,3,4)
+  defm ST1x2 : STVList_BHSD<0b1010, "VPair", "st1">;
----------------
James Molloy wrote:
> Failed to reindent this line?
Done.

================
Comment at: lib/Target/AArch64/AArch64InstrNEON.td:3390
@@ +3389,3 @@
+
+// this should work in BE - single operand: no element swaps
+let Predicates = [IsLE] in {
----------------
James Molloy wrote:
> The comment says this should work in BE mode... then the predicate stops it working in BE mode. Why?
first guess at step 5 in our roadmap  - and being conservative.

Anyway - fixed, now that I had the time to think through it all.

================
Comment at: lib/Target/AArch64/AArch64InstrNEON.td:3383
@@ +3382,3 @@
+
+// LD1 disallowed in BE, when LDR and STR are used exclusively to save on 12bit offset adds.
+// reason: LDR/STR use different memory/register layout (no element swaps).
----------------
James Molloy wrote:
> This line of the comment doesn't make sense. LD1 is disallowed, but it has nothing to do with 12bit offset adds! You give the reasoning below.
> 
> Also in LLVM we have a tendency to write comments in a fairly prose-like form to make it easier to read. For example, instead of "LD1 disallowed in BE", "LD1 is disallowed in BE mode". Also instead of "reason: ", "This is because..."
> 
> 
right - was mixing the BE & LE issues, there
fixed by adding the whole ugly story, right at the start of the store section.
Wish we could separate the stores out from that 10.000 lines file. :-(

ad Comments: - that's probably why there were so many helpful comments in that file :-(
At many places I'd have preferred a short comment over none at all.

================
Comment at: lib/Target/AArch64/AArch64InstrInfo.td:4897
@@ +4896,3 @@
+    defm : ls_neutral_pats<LOAD, STORE, Base, Offset, address, v1i32>;
+//    defm : ls_neutral_pats<LOAD, STORE, Base, Offset, address, v1f32>; does not exist - v1f64 DOES --  WHY ?
+  }
----------------
Hao Liu wrote:
> James Molloy wrote:
> > v1f64 is there because some instructions that act on f64's have both NEON and VFP variants, and in order that NEON intrinsics written by the user select the instruction that the user requested, there must be a way to distinguish between the two at the type level.
> > 
> > That is my understanding anyway - Hao or Jiangning will know more about this.
> Yes. There are some NEON instructions like "ld1, {Vt.1d}, [xn]" with v1f64/v1i64 types, which are treated as one element vector.
> As NEON only has 64-bit and 128-bit length vector types, there is no v1f32. f32/f64 are used as scalar types.
> As for v1i8/v1i16/v1i32, they are only used to represent scalar types not vector types. They are used to make difference with the integer types stored in GPR.
hmm I don't see the necessity for a separate nonterminal, there.
either 
a) Intrinsics are matched directly - so no need there.
b) As long as the semantic isn't different, the user couldn't care less.
(-> pseudo instruction that can be lowered later as fits)

Nonterminals are data-types for registers.
Separate nonterminals are for different in-register representations, e.g. they might be handy for "reversed" vectors, together with the chain rules that do the element-swaps.
They're really handy for high'low positioning of values in register.
But that all requires a PBQP matcher that finds the optimal coverage for the function's DAG.


http://llvm-reviews.chandlerc.com/D2884