[PATCH] D46283: [AArch64] Set vectorizer-maximize-bandwidth as default true

Mon Apr 30 13:13:05 PDT 2018

zatrazz created this revision.
zatrazz added reviewers: fhahn, rengolin, javed.absar, huntergr, SjoerdMeijer, t.p.northover, echristo, evandro.
Herald added a subscriber: kristof.beyls.

Although this shows no virtual gain in speccpu2006 on A72:

Benchmark       Diff
400.perlbench   +1.55
401.bzip2       -1.22
403.gcc         +0.73
429.mcf         +3.00
445.gobmk       -0.39
456.hmmer       -0.90
458.sjeng       -0.41
462.libquantum  -1.91
464.h264ref      0.00
471.omnetpp     -0.64
473.astar       -0.38
483.xalancbmk    0.90
geomean:         0.04

It shows some good improvements in generic loops code where each
element is truncate to a narrow type.  For instance vector body for
the following code:

  void store_i32_to_i8 (const int *src, int width, unsigned char *dst)
  {
    for (int i = 0; i < width; i++) {
     *dst++ = *src++;
    }
  }

It currently compiled to:

---

.LBB0_4:                                // %vector.body

                                  // =>This Inner Loop Header: Depth=1
  ldp     w14, w15, [x11, #-4]
  add     x11, x11, #8            // =8
  subs    x13, x13, #2            // =2
  sturb   w14, [x12, #-1]
  strb    w15, [x12], #2
  b.ne    .LBB0_4

---

Where with current patch it is now compiled to:

---

.LBB0_4:                                // %vector.body

                                  // =>This Inner Loop Header: Depth=1
  ldp     q0, q1, [x11, #-64]
  ldp     q2, q3, [x11, #-32]
  ldp     q4, q5, [x11]
  ldp     q6, q7, [x11, #32]
  xtn     v0.4h, v0.4s
  xtn     v2.4h, v2.4s
  xtn2    v2.8h, v3.4s
  xtn2    v0.8h, v1.4s
  xtn     v6.4h, v6.4s
  xtn     v4.4h, v4.4s
  xtn     v0.8b, v0.8h
  xtn2    v0.16b, v2.8h
  xtn2    v6.8h, v7.4s
  xtn2    v4.8h, v5.4s
  xtn     v1.8b, v4.8h
  xtn2    v1.16b, v6.8h
  add     x11, x11, #128          // =128
  subs    x13, x13, #32           // =32
  stp     q0, q1, [x12, #-16]
  add     x12, x12, #32           // =32
  b.ne    .LBB0_4

---

It is a increase of about 12% of throughput in a micro-benchmark with an array of
16777216 elements.

Repository:
  rL LLVM

https://reviews.llvm.org/D46283

Files:
  lib/Target/AArch64/AArch64TargetTransformInfo.h
  test/Transforms/LoopVectorize/AArch64/aarch64-trunc-vec.ll
  test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
  test/Transforms/LoopVectorize/AArch64/reduction-small-size.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D46283.144621.patch
Type: text/x-patch
Size: 6238 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20180430/48015f2f/attachment.bin>