[PATCH] D101924: [X86] Improve costmodel for scalar byte swaps

Wed May 5 14:15:18 PDT 2021

lebedev.ri added inline comments.

================
Comment at: llvm/lib/Target/X86/X86TargetTransformInfo.cpp:2927
+      if (const Instruction *II = ICA.getInst()) {
+        if (II->hasOneUse() && isa<StoreInst>(II->user_back()))
+          return TTI::TCC_Free;
----------------
craig.topper wrote:
> At least on Intel Core CPUs, MOVBE isn't optimized. It's a load or store and a bswap operation. Maybe it's optimized on Atom/Silvermont/Goldmont? It was added to that line of CPU first possibly because those CPUs have been used in networking equipment.
Looking at actual AMD Zen3 measurements, `movbe r<-m` is `1` uop, while `movbe m<-r` is `2`,
which is actually a regression from Zen2/Zen1, as per https://www.agner.org/optimize/instruction_tables.pdf.

As per that table, both are really slow on haswell/broadwell/skylake*,
but fast on Silvermont/Goldmont*/KNL.

So i think we could mark `movbe r<-m` on AMD's at least.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101924/new/

https://reviews.llvm.org/D101924