[all-commits] [llvm/llvm-project] 8b53fd: [X86] Custom legalize v16i64->v16i8 truncate with ...

Sun May 3 23:26:35 PDT 2020

  Branch: refs/heads/master
  Home:   https://github.com/llvm/llvm-project
  Commit: 8b53fdd3b659283c3e048668e266945b44470771
      https://github.com/llvm/llvm-project/commit/8b53fdd3b659283c3e048668e266945b44470771
  Author: Craig Topper <craig.topper at gmail.com>
  Date:   2020-05-03 (Sun, 03 May 2020)

  Changed paths:
    M llvm/lib/Target/X86/X86ISelLowering.cpp
    M llvm/test/CodeGen/X86/vector-trunc-math.ll
    M llvm/test/CodeGen/X86/vector-trunc-packus.ll
    M llvm/test/CodeGen/X86/vector-trunc-ssat.ll
    M llvm/test/CodeGen/X86/vector-trunc-usat.ll

  Log Message:
  -----------
  [X86] Custom legalize v16i64->v16i8 truncate with avx512.

Default legalization will create two v8i64 truncs to v8i32, concat
them to v16i32, and then truncate the rest of the way to v16i8.

Instead we can truncate directly from v8i64 to v8i8 in the lower
half of an xmm. Then concat the two halves to use vpunpcklqdq.
This is the same number of uops, but the dependency chain through
the uops is better since the halves are merged at the end.

I had to had SimplifyDemandedBits support for VTRUNC to prevent
a regression on vector-trunc-math.ll. combineTruncatedArithmetic
no longer gets a chance to shrink vXi64 mul so we were producing
the v8i64 multiply sequence using multiple PMULUDQs. With the
demanded bits fix we are able to prune out the extra ops leaving
just two PMULUDQs, one for each v8i64 half. This is twice the
width of the 2 v8i32 PMULLDs we had before, but PMULUDQ is 1
uop and PMULLD is 2. We also save some truncates. It's probably
worth using PMULUDQ even when PMULLQ is available since the latter
is 3 uops, but that will require a different change.

Differential Revision: https://reviews.llvm.org/D79231