[PATCH] D123163: [TLI] `TargetLowering::SimplifyDemandedVectorElts()`: narrowing bitcast: fill known zero elts from known src bits

Wed Apr 6 03:45:06 PDT 2022

lebedev.ri added inline comments.

================
Comment at: llvm/test/CodeGen/X86/slow-pmulld.ll:284
 ; AVX2-64-NEXT:    vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
-; AVX2-64-NEXT:    vpbroadcastd {{.*#+}} ymm2 = [18778,18778,18778,18778,18778,18778,18778,18778]
+; AVX2-64-NEXT:    vmovdqa {{.*#+}} ymm2 = <18778,u,18778,u,18778,u,18778,u,18778,u,18778,u,18778,u,18778,u>
 ; AVX2-64-NEXT:    vpmaddwd %ymm2, %ymm0, %ymm0
----------------
RKSimon wrote:
> lebedev.ri wrote:
> > lebedev.ri wrote:
> > > With AVX1, we can only broadcast i32 load to XMM/YMM, and i64 to YMM,
> > > but with AVX2 we can broadcast i8/i16/i32 load to XMM/YMM.
> > > Is `lowerBuildVectorAsBroadcast()` intentionally not doing that,
> > > because such i8/i16 broadcasts are slow, or is that a bug?
> > ~~but with AVX2 we can broadcast i8/i16/i32 load to XMM/YMM.~~
> > but with AVX2 we can broadcast i8/i16/i32/64 load to XMM/YMM.
> > 
> Its one of the many annoyances of lowering constant broadcasts that I mentioned on https://github.com/llvm/llvm-project/issues/54743 - I think this is because AVX512 doesn't have many ops that do i8/i16 broadcast-memory folds? Let's accept it for now.
I have a patch, but it shows a number of load folding failures instead :S

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D123163/new/

https://reviews.llvm.org/D123163