[PATCH] D132559: [AArch64] Add support for 128-bit non temporal loads.

Mon Aug 29 01:09:31 PDT 2022

dmgreen added a comment.

I don't think they ever get ignored by the hardware, but it is only a hints. I would be surprised in the extra instructions are better than a normal load, but it will depend on how much pressure this is on the cache at the time. I don't think there is a lot in it either way though, and if non-temporal loads are being used the cpu is more likely to have high memory usage with lower computation, meaning the extra instructions are less of an issue.

================
Comment at: llvm/test/CodeGen/AArch64/nontemporal-load.ll:213
 ; CHECK:       ; %bb.0:
-; CHECK-NEXT:    ldp q1, q2, [x0, #32]
-; CHECK-NEXT:    ldp q3, q4, [x0]
-; CHECK-NEXT:    ldr s0, [x0, #64]
-; CHECK-NEXT:    stp q3, q4, [x8]
-; CHECK-NEXT:    stp q1, q2, [x8, #32]
-; CHECK-NEXT:    str s0, [x8, #64]
+; CHECK-NEXT:    ldnp d1, d0, [x0, #16]
+; CHECK-NEXT:    ldnp d3, d2, [x0, #48]
----------------
fhahn wrote:
> I guess we would also use the `LDNQ` variant here. I assume the reason we don't is because `<17 x float>` will get broken down to `<4 x float>` pieces during legalization.
> 
> @dmgreen do you by any chance have any ideas on where to best improve this?
I'm not sure I'm afraid. I believe that loads get split to legal parts (not in half like other operations).

If we wouldn't expect 17x non temporal loads very often, it may not be too important to fix. The loop vectorizer will always pick powers of 2 after all.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132559/new/

https://reviews.llvm.org/D132559