[PATCH] D96609: [X86][AVX] Truncate vectors with PACKSS/PACKUS on AVX2 targets

Wed Feb 24 07:14:53 PST 2021

pengfei added inline comments.

================
Comment at: llvm/test/CodeGen/X86/vector-reduce-and-bool.ll:559
+; AVX2-NEXT:    vpblendw {{.*#+}} ymm1 = ymm1[0],ymm2[1,2,3],ymm1[4],ymm2[5,6,7],ymm1[8],ymm2[9,10,11],ymm1[12],ymm2[13,14,15]
+; AVX2-NEXT:    vpblendw {{.*#+}} ymm0 = ymm0[0],ymm2[1,2,3],ymm0[4],ymm2[5,6,7],ymm0[8],ymm2[9,10,11],ymm0[12],ymm2[13,14,15]
+; AVX2-NEXT:    vpackusdw %ymm1, %ymm0, %ymm0
----------------
RKSimon wrote:
> craig.topper wrote:
> > pengfei wrote:
> > > RKSimon wrote:
> > > > xbolva00 wrote:
> > > > > Worse?
> > > > We remove lane crossing shuffles, a pshufb (so no constant pool mask load) and a domain crossing shufps. Some AVX2 targets won't care but others will (e.g. znver1 will love losing the lane shuffles).
> > > So it means some targets worse and some better?
> > Arent most lane crossing shuffles on Intel 3 cycles?
> By 'won't care' I meant the diff shouldn't be a regression on any target but some targets would benefit more than others - in particular by getting rid of the vperm2f128 which have gotten slower since Haswell on Intel targets (and faster since Zen2 on AMD targets).
I compared the [[ https://uops.info/table.html?search=vperm2f128&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SNB=on&cb_HSW=on&cb_SKL=on&cb_ZEN%2B=on&cb_ZEN2=on&cb_measurements=on&cb_iaca30=on&cb_doc=on&cb_base=on&cb_avx=on&cb_avx2=on | uops of vperm2f128 ]], Haswell and latter Intel targets as well as AMX Zen2 have the same performance: Lat = 3, Uops =1. Zen1 has big gap since Lat = 4, Uops = 8.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D96609/new/

https://reviews.llvm.org/D96609