[PATCH] D133421: [AArch64] break non-temporal loads over 256 into 256-loads and a smaller load

Mon Sep 12 02:19:03 PDT 2022

zjaffal added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:17867-17869
+  SDValue ExtendedReminingLoad =
+      DAG.getNode(ISD::INSERT_SUBVECTOR, DL, NewVT,
+                  {UndefVector, RemainingLoad, InsertIdx});
----------------
t.p.northover wrote:
> Is this (and the implementation generally) big-endian correct? I don't know the answer here, I can never remember what's supposed to happen. But someone should definitely try it on an `aarch64_be` target and at least eyeball the assembly to check the offsets and so on.
This is the generated assembly for big-endian
using the following test case
```
define <17 x float> @test_ldnp_v17f32(<17 x float>* %A) {
  %lv = load<17 x float>, <17 x float>* %A, align 8, !nontemporal !0
  ret <17 x float> %lv
}

!0 = !{i32 1}
```
```
test_ldnp_v17f32:                       // @test_ldnp_v17f32
	.cfi_startproc
// %bb.0:
	ldnp	q0, q1, [x0, #32]
	add	x9, x8, #48
	add	x10, x8, #32
	ldnp	q2, q3, [x0]
	add	x11, x8, #16
	ldr	s4, [x0, #64]
	st1	{ v2.4s }, [x8]
	st1	{ v1.4s }, [x9]
	st1	{ v0.4s }, [x10]
	st1	{ v3.4s }, [x11]
	str	s4, [x8, #64]
	ret
```
Looking at https://godbolt.org I think there are more load instructions before breaking them

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133421/new/

https://reviews.llvm.org/D133421