[PATCH] D118534: [X86] Introduce more common modern tunings into `generic`

Thu Feb 3 19:47:40 PST 2022

pengfei added inline comments.

================
Comment at: llvm/lib/Target/X86/X86.td:1222
                  TuningMacroFusion,
+                 TuningFastScalarFSQRT,
+                 TuningFast15ByteNOP,
----------------
spatel wrote:
> Fast scalar SQRT controls whether we produce a single-precision sqrtss instruction or a reciprocal estimate sequence of about 9 instructions (when allowed with fast-math).
> 
> Based on Agner's timing docs and the flag description, this should be set for Zen1 (sqrtss has latency 9-10), but it's not as obviously good for Zen2/3 because those have sqrtss latency of 14. The flag is set for Intel CPUs since SandyBridge, so that's sqrtss latency between 10-14.
> 
> I think this is ok to set, but if the assumption is that we're tuning for any mainstream CPU of the last N years, then shouldn't we add this flag to the later AMD models too for less surprising output? There's a possible side benefit that we will produce more accurate results too.
[[ https://uops.info/table.html?search=sqrtss%20&cb_lat=on&cb_tp=on&cb_NHM=on&cb_SNB=on&cb_BNL=on&cb_ZENp=on&cb_ZEN2=on&cb_ZEN3=on&cb_measurements=on&cb_avx=on&cb_sse=on | uops ]] shows all Zen1/2/3 have the same max latency 14.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118534/new/

https://reviews.llvm.org/D118534