[PATCH] D133421: [AArch64] break non-temporal loads over 256 into 256-loads and a smaller load

Mon Sep 12 03:08:50 PDT 2022

t.p.northover added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:17867-17869
+  SDValue ExtendedReminingLoad =
+      DAG.getNode(ISD::INSERT_SUBVECTOR, DL, NewVT,
+                  {UndefVector, RemainingLoad, InsertIdx});
----------------
zjaffal wrote:
> t.p.northover wrote:
> > Is this (and the implementation generally) big-endian correct? I don't know the answer here, I can never remember what's supposed to happen. But someone should definitely try it on an `aarch64_be` target and at least eyeball the assembly to check the offsets and so on.
> This is the generated assembly for big-endian
> using the following test case
> ```
> define <17 x float> @test_ldnp_v17f32(<17 x float>* %A) {
>   %lv = load<17 x float>, <17 x float>* %A, align 8, !nontemporal !0
>   ret <17 x float> %lv
> }
> 
> !0 = !{i32 1}
> ```
> ```
> test_ldnp_v17f32:                       // @test_ldnp_v17f32
> 	.cfi_startproc
> // %bb.0:
> 	ldnp	q0, q1, [x0, #32]
> 	add	x9, x8, #48
> 	add	x10, x8, #32
> 	ldnp	q2, q3, [x0]
> 	add	x11, x8, #16
> 	ldr	s4, [x0, #64]
> 	st1	{ v2.4s }, [x8]
> 	st1	{ v1.4s }, [x9]
> 	st1	{ v0.4s }, [x10]
> 	st1	{ v3.4s }, [x11]
> 	str	s4, [x8, #64]
> 	ret
> ```
> Looking at https://godbolt.org I think there are more load instructions before breaking them
It's coming back to me now, and I think that output is incorrect.

Basically, the `ld1` and `st1` instructions would load each element big-endian but preserve the order (i.e. `v0[0]` is still loaded from `&ptr[0]` not `&ptr[3]`. On the other hand `ldr` and `ldnp` load the whole vector-register in big-endian order which loads `v0[0]` from `&ptr[3]`.

This output mixes the two so it'll rearrange elements.

But I think the real bug is in https://reviews.llvm.org/rG7155ed4289 that you committed earlier, or at least that needs fixing before any bug here is obvious.

A proper fix would be to put an `AArch64ISD::REV<n>` (where `<n>` is the element size) after each `ldnp` to restore what an `ld1` would have done, though I'd be OK with disabling the optimization for big-endian instead. I care if we break it, I'm less bothered if it isn't as fast as little-endian.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:19934
+        performTBISimplification(N->getOperand(1), DCI, DAG))
       return SDValue(N, 0);
+    return performLOADCombine(N, DCI, DAG, Subtarget);
----------------
zjaffal wrote:
> t.p.northover wrote:
> > This looks like it takes precedence over `performLOADCombine` and disables `ldnp` formation if the TBI feature is enabled.
> Removing the checks and calling `performLOADCombine` caused many tests to fail. Maybe we can check if the load is non-temporal here ? 
The check is needed because otherwise we assume TBI even when we shouldn't, but it probably shouldn't cause an early return. Both optimizations should get the opportunity to run if they might be applicable.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133421/new/

https://reviews.llvm.org/D133421