[llvm] r218263 - Use broadcasts to optimize overall size when loading constant splat vectors (x86-64 with AVX or AVX2).

Filipe Cabecinhas filcab at gmail.com
Fri Apr 17 06:01:48 PDT 2015


PR23259 was opened and tracked to this revision. It looks like we're just
not matching i64 (nor "converting" between i64 and double to match the
broadcast).

https://llvm.org/bugs/show_bug.cgi?id=23259

  F

On Mon, Sep 22, 2014 at 7:54 PM, Sanjay Patel <spatel at rotateright.com>
wrote:

> Author: spatel
> Date: Mon Sep 22 13:54:01 2014
> New Revision: 218263
>
> URL: http://llvm.org/viewvc/llvm-project?rev=218263&view=rev
> Log:
> Use broadcasts to optimize overall size when loading constant splat
> vectors (x86-64 with AVX or AVX2).
>
> We generate broadcast instructions on CPUs with AVX2 to load some constant
> splat vectors.
> This patch should preserve all existing behavior with regular optimization
> levels,
> but also use splats whenever possible when optimizing for *size* on any
> CPU with AVX or AVX2.
>
> The tradeoff is up to 5 extra instruction bytes for the broadcast
> instruction to save
> at least 8 bytes (up to 31 bytes) of constant pool data.
>
> Differential Revision: http://reviews.llvm.org/D5347
>
>
> Added:
>     llvm/trunk/test/CodeGen/X86/splat-for-size.ll
> Modified:
>     llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
>     llvm/trunk/lib/Target/X86/X86InstrSSE.td
>
> Modified: llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86ISelLowering.cpp?rev=218263&r1=218262&r2=218263&view=diff
>
> ==============================================================================
> --- llvm/trunk/lib/Target/X86/X86ISelLowering.cpp (original)
> +++ llvm/trunk/lib/Target/X86/X86ISelLowering.cpp Mon Sep 22 13:54:01 2014
> @@ -5996,7 +5996,10 @@ static SDValue EltsFromConsecutiveLoads(
>  /// or SDValue() otherwise.
>  static SDValue LowerVectorBroadcast(SDValue Op, const X86Subtarget*
> Subtarget,
>                                      SelectionDAG &DAG) {
> -  if (!Subtarget->hasFp256())
> +  // VBROADCAST requires AVX.
> +  // TODO: Splats could be generated for non-AVX CPUs using SSE
> +  // instructions, but there's less potential gain for only 128-bit
> vectors.
> +  if (!Subtarget->hasAVX())
>      return SDValue();
>
>    MVT VT = Op.getSimpleValueType();
> @@ -6073,17 +6076,34 @@ static SDValue LowerVectorBroadcast(SDVa
>      }
>    }
>
> +  unsigned ScalarSize = Ld.getValueType().getSizeInBits();
>    bool IsGE256 = (VT.getSizeInBits() >= 256);
>
> -  // Handle the broadcasting a single constant scalar from the constant
> pool
> -  // into a vector. On Sandybridge it is still better to load a constant
> vector
> +  // When optimizing for size, generate up to 5 extra bytes for a
> broadcast
> +  // instruction to save 8 or more bytes of constant pool data.
> +  // TODO: If multiple splats are generated to load the same constant,
> +  // it may be detrimental to overall size. There needs to be a way to
> detect
> +  // that condition to know if this is truly a size win.
> +  const Function *F = DAG.getMachineFunction().getFunction();
> +  bool OptForSize = F->getAttributes().
> +    hasAttribute(AttributeSet::FunctionIndex, Attribute::OptimizeForSize);
> +
> +  // Handle broadcasting a single constant scalar from the constant pool
> +  // into a vector.
> +  // On Sandybridge (no AVX2), it is still better to load a constant
> vector
>    // from the constant pool and not to broadcast it from a scalar.
> -  if (ConstSplatVal && Subtarget->hasInt256()) {
> +  // But override that restriction when optimizing for size.
> +  // TODO: Check if splatting is recommended for other AVX-capable CPUs.
> +  if (ConstSplatVal && (Subtarget->hasAVX2() || OptForSize)) {
>      EVT CVT = Ld.getValueType();
>      assert(!CVT.isVector() && "Must not broadcast a vector type");
> -    unsigned ScalarSize = CVT.getSizeInBits();
>
> -    if (ScalarSize == 32 || (IsGE256 && ScalarSize == 64)) {
> +    // Splat f32, i32, v4f64, v4i64 in all cases with AVX2.
> +    // For size optimization, also splat v2f64 and v2i64, and for size opt
> +    // with AVX2, also splat i8 and i16.
> +    // With pattern matching, the VBROADCAST node may become a VMOVDDUP.
> +    if (ScalarSize == 32 || (IsGE256 && ScalarSize == 64) ||
> +        (OptForSize && (ScalarSize == 64 || Subtarget->hasAVX2()))) {
>        const Constant *C = nullptr;
>        if (ConstantSDNode *CI = dyn_cast<ConstantSDNode>(Ld))
>          C = CI->getConstantIntValue();
> @@ -6104,7 +6124,6 @@ static SDValue LowerVectorBroadcast(SDVa
>    }
>
>    bool IsLoad = ISD::isNormalLoad(Ld.getNode());
> -  unsigned ScalarSize = Ld.getValueType().getSizeInBits();
>
>    // Handle AVX2 in-register broadcasts.
>    if (!IsLoad && Subtarget->hasInt256() &&
>
> Modified: llvm/trunk/lib/Target/X86/X86InstrSSE.td
> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86InstrSSE.td?rev=218263&r1=218262&r2=218263&view=diff
>
> ==============================================================================
> --- llvm/trunk/lib/Target/X86/X86InstrSSE.td (original)
> +++ llvm/trunk/lib/Target/X86/X86InstrSSE.td Mon Sep 22 13:54:01 2014
> @@ -5290,6 +5290,13 @@ let Predicates = [HasAVX] in {
>              (VMOVDDUPYrr VR256:$src)>;
>  }
>
> +let Predicates = [UseAVX, OptForSize] in {
> +  def : Pat<(v2f64 (X86VBroadcast (loadf64 addr:$src))),
> +  (VMOVDDUPrm addr:$src)>;
> +  def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
> +  (VMOVDDUPrm addr:$src)>;
> +}
> +
>  let Predicates = [UseSSE3] in {
>    def : Pat<(X86Movddup (memopv2f64 addr:$src)),
>              (MOVDDUPrm addr:$src)>;
>
> Added: llvm/trunk/test/CodeGen/X86/splat-for-size.ll
> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/splat-for-size.ll?rev=218263&view=auto
>
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/splat-for-size.ll (added)
> +++ llvm/trunk/test/CodeGen/X86/splat-for-size.ll Mon Sep 22 13:54:01 2014
> @@ -0,0 +1,141 @@
> +; RUN: llc -mtriple=x86_64-unknown-unknown -mattr=avx < %s | FileCheck %s
> -check-prefix=CHECK --check-prefix=AVX
> +; RUN: llc -mtriple=x86_64-unknown-unknown -mattr=avx2 < %s | FileCheck
> %s -check-prefix=CHECK --check-prefix=AVX2
> +
> +; Check constant loads of every 128-bit and 256-bit vector type
> +; for size optimization using splat ops available with AVX and AVX2.
> +
> +; There is no AVX broadcast from double to 128-bit vector because movddup
> has been around since SSE3 (grrr).
> +define <2 x double> @splat_v2f64(<2 x double> %x) #0 {
> +  %add = fadd <2 x double> %x, <double 1.0, double 1.0>
> +  ret <2 x double> %add
> +; CHECK-LABEL: splat_v2f64
> +; CHECK: vmovddup
> +; CHECK: vaddpd
> +; CHECK-NEXT: retq
> +}
> +
> +define <4 x double> @splat_v4f64(<4 x double> %x) #0 {
> +  %add = fadd <4 x double> %x, <double 1.0, double 1.0, double 1.0,
> double 1.0>
> +  ret <4 x double> %add
> +; CHECK-LABEL: splat_v4f64
> +; CHECK: vbroadcastsd
> +; CHECK-NEXT: vaddpd
> +; CHECK-NEXT: retq
> +}
> +
> +define <4 x float> @splat_v4f32(<4 x float> %x) #0 {
> +  %add = fadd <4 x float> %x, <float 1.0, float 1.0, float 1.0, float 1.0>
> +  ret <4 x float> %add
> +; CHECK-LABEL: splat_v4f32
> +; CHECK: vbroadcastss
> +; CHECK-NEXT: vaddps
> +; CHECK-NEXT: retq
> +}
> +
> +define <8 x float> @splat_v8f32(<8 x float> %x) #0 {
> +  %add = fadd <8 x float> %x, <float 1.0, float 1.0, float 1.0, float
> 1.0, float 1.0, float 1.0, float 1.0, float 1.0>
> +  ret <8 x float> %add
> +; CHECK-LABEL: splat_v8f32
> +; CHECK: vbroadcastss
> +; CHECK-NEXT: vaddps
> +; CHECK-NEXT: retq
> +}
> +
> +; AVX can't do integer splats, so fake it: use vmovddup to splat 64-bit
> value.
> +; We also generate vmovddup for AVX2 because it's one byte smaller than
> vpbroadcastq.
> +define <2 x i64> @splat_v2i64(<2 x i64> %x) #0 {
> +  %add = add <2 x i64> %x, <i64 1, i64 1>
> +  ret <2 x i64> %add
> +; CHECK-LABEL: splat_v2i64
> +; CHECK: vmovddup
> +; CHECK: vpaddq
> +; CHECK-NEXT: retq
> +}
> +
> +; AVX can't do 256-bit integer ops, so we split this into two 128-bit
> vectors,
> +; and then we fake it: use vmovddup to splat 64-bit value.
> +define <4 x i64> @splat_v4i64(<4 x i64> %x) #0 {
> +  %add = add <4 x i64> %x, <i64 1, i64 1, i64 1, i64 1>
> +  ret <4 x i64> %add
> +; CHECK-LABEL: splat_v4i64
> +; AVX: vmovddup
> +; AVX: vpaddq
> +; AVX: vpaddq
> +; AVX2: vpbroadcastq
> +; AVX2: vpaddq
> +; CHECK: retq
> +}
> +
> +; AVX can't do integer splats, so fake it: use vbroadcastss to splat
> 32-bit value.
> +define <4 x i32> @splat_v4i32(<4 x i32> %x) #0 {
> +  %add = add <4 x i32> %x, <i32 1, i32 1, i32 1, i32 1>
> +  ret <4 x i32> %add
> +; CHECK-LABEL: splat_v4i32
> +; AVX: vbroadcastss
> +; AVX2: vpbroadcastd
> +; CHECK-NEXT: vpaddd
> +; CHECK-NEXT: retq
> +}
> +
> +; AVX can't do integer splats, so fake it: use vbroadcastss to splat
> 32-bit value.
> +define <8 x i32> @splat_v8i32(<8 x i32> %x) #0 {
> +  %add = add <8 x i32> %x, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32
> 1, i32 1>
> +  ret <8 x i32> %add
> +; CHECK-LABEL: splat_v8i32
> +; AVX: vbroadcastss
> +; AVX: vpaddd
> +; AVX: vpaddd
> +; AVX2: vpbroadcastd
> +; AVX2: vpaddd
> +; CHECK: retq
> +}
> +
> +; AVX can't do integer splats, and there's no broadcast fakery for
> 16-bit. Could use pshuflw, etc?
> +define <8 x i16> @splat_v8i16(<8 x i16> %x) #0 {
> +  %add = add <8 x i16> %x, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16
> 1, i16 1>
> +  ret <8 x i16> %add
> +; CHECK-LABEL: splat_v8i16
> +; AVX-NOT: broadcast
> +; AVX2: vpbroadcastw
> +; CHECK: vpaddw
> +; CHECK-NEXT: retq
> +}
> +
> +; AVX can't do integer splats, and there's no broadcast fakery for
> 16-bit. Could use pshuflw, etc?
> +define <16 x i16> @splat_v16i16(<16 x i16> %x) #0 {
> +  %add = add <16 x i16> %x, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1,
> i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
> +  ret <16 x i16> %add
> +; CHECK-LABEL: splat_v16i16
> +; AVX-NOT: broadcast
> +; AVX: vpaddw
> +; AVX: vpaddw
> +; AVX2: vpbroadcastw
> +; AVX2: vpaddw
> +; CHECK: retq
> +}
> +
> +; AVX can't do integer splats, and there's no broadcast fakery for 8-bit.
> Could use pshufb, etc?
> +define <16 x i8> @splat_v16i8(<16 x i8> %x) #0 {
> +  %add = add <16 x i8> %x, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8
> 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
> +  ret <16 x i8> %add
> +; CHECK-LABEL: splat_v16i8
> +; AVX-NOT: broadcast
> +; AVX2: vpbroadcastb
> +; CHECK: vpaddb
> +; CHECK-NEXT: retq
> +}
> +
> +; AVX can't do integer splats, and there's no broadcast fakery for 8-bit.
> Could use pshufb, etc?
> +define <32 x i8> @splat_v32i8(<32 x i8> %x) #0 {
> +  %add = add <32 x i8> %x, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8
> 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1,
> i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
> +  ret <32 x i8> %add
> +; CHECK-LABEL: splat_v32i8
> +; AVX-NOT: broadcast
> +; AVX: vpaddb
> +; AVX: vpaddb
> +; AVX2: vpbroadcastb
> +; AVX2: vpaddb
> +; CHECK: retq
> +}
> +
> +attributes #0 = { optsize }
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150417/64916770/attachment.html>


More information about the llvm-commits mailing list