[all-commits] [llvm/llvm-project] cc8392: [CVP] Canonicalize signed minmax into unsigned (#8...

Thu Feb 22 11:37:10 PST 2024

  Branch: refs/heads/users/alexey-bataev/spr/slpimprove-minbitwidth-analysis
  Home:   https://github.com/llvm/llvm-project
  Commit: cc839275164a7768451531af868fa70eb9e71cbd
      https://github.com/llvm/llvm-project/commit/cc839275164a7768451531af868fa70eb9e71cbd
  Author: Yingwei Zheng <dtcxzyw2333 at gmail.com>
  Date:   2024-02-23 (Fri, 23 Feb 2024)

  Changed paths:
    M llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
    M llvm/test/Transforms/CorrelatedValuePropagation/min-max.ll

  Log Message:
  -----------
  [CVP] Canonicalize signed minmax into unsigned (#82478)

This patch turns signed minmax to unsigned to match the behavior for
signed icmps.
Alive2: https://alive2.llvm.org/ce/z/UAAM42

  Commit: 33a6ce18373ffd1457ebd54e930b6f02fe4c39c1
      https://github.com/llvm/llvm-project/commit/33a6ce18373ffd1457ebd54e930b6f02fe4c39c1
  Author: Yaxun (Sam) Liu <yaxun.liu at amd.com>
  Date:   2024-02-22 (Thu, 22 Feb 2024)

  Changed paths:
    M clang/lib/CodeGen/CGCUDANV.cpp
    M clang/lib/CodeGen/CodeGenModule.cpp
    M clang/lib/Driver/OffloadBundler.cpp
    M clang/lib/Driver/ToolChains/HIPUtility.cpp
    M clang/test/CMakeLists.txt
    M clang/test/CodeGenCUDA/device-stub.cu
    M clang/test/CodeGenCUDA/host-used-device-var.cu
    A clang/test/Driver/Inputs/hip.h
    M clang/test/Driver/clang-offload-bundler.c
    A clang/test/Driver/hip-partial-link.hip
    M clang/test/Driver/hip-toolchain-rdc.hip

  Log Message:
  -----------
  [HIP] Allow partial linking for `-fgpu-rdc` (#81700)

`-fgpu-rdc` mode allows device functions call device functions in
different TU. However, currently all device objects have to be linked
together since only one fat binary is supported. This is time consuming
for AMDGPU backend since it only supports LTO.

There are use cases that objects can be divided into groups in which
device functions are self-contained but host functions are not. It is
desirable to link/optimize/codegen the device code and generate a fatbin
for each group, whereas partially link the host code with `ld -r` or
generate a static library by using the `--emit-static-lib` option of
clang. This avoids linking all device code together, therefore decreases
the linking time for `-fgpu-rdc`.

Previously, clang emits an external symbol `__hip_fatbin` for all
objects for `-fgpu-rdc`. With this patch, clang emits an unique external
symbol `__hip_fatbin_{cuid}` for the fat binary for each object. When a
group of objects are linked together to generate a fatbin, the symbols
are merged by alias and point to the same fat binary. Each group has its
own fat binary. One executable or shared library can have multiple fat
binaries. Device linking is done for undefined fab binary symbols only
to avoid repeated linking. `__hip_gpubin_handle` is also uniquefied and
merged to avoid repeated registering. Symbol `__hip_cuid_{cuid}` is
introduced to facilitate debugging and tooling.

Fixes: https://github.com/llvm/llvm-project/issues/77018

  Commit: 1069823ce7d154aa8ef87ae5a0fd34b527eca2a0
      https://github.com/llvm/llvm-project/commit/1069823ce7d154aa8ef87ae5a0fd34b527eca2a0
  Author: Alexander Shaposhnikov <6532716+alexander-shaposhnikov at users.noreply.github.com>
  Date:   2024-02-22 (Thu, 22 Feb 2024)

  Changed paths:
    M llvm/lib/Passes/PassBuilderPipelines.cpp
    M llvm/test/Other/new-pm-defaults.ll
    M llvm/test/Other/new-pm-thinlto-postlink-defaults.ll
    M llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
    M llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
    M llvm/test/Other/new-pm-thinlto-prelink-defaults.ll
    M llvm/test/Other/new-pm-thinlto-prelink-pgo-defaults.ll
    M llvm/test/Other/new-pm-thinlto-prelink-samplepgo-defaults.ll

  Log Message:
  -----------
  Enable JumpTableToSwitch pass by default (#82546)

Enable JumpTableToSwitch pass by default.

Test plan: ninja check-all

  Commit: 4f7ab789bf43b49914815bdf4e4c3703f92e781d
      https://github.com/llvm/llvm-project/commit/4f7ab789bf43b49914815bdf4e4c3703f92e781d
  Author: Boian Petkantchin <boian.petkantchin at amd.com>
  Date:   2024-02-22 (Thu, 22 Feb 2024)

  Changed paths:
    M mlir/lib/Dialect/Mesh/Transforms/Spmdization.cpp
    M mlir/test/Dialect/Mesh/spmdization.mlir

  Log Message:
  -----------
  [mlir][mesh] add support in spmdization for incomplete sharding annotations (#82442)

Don't require that `mesh.shard` operations come in pairs. If there is
only a single `mesh.shard` operation we assume that the producer result
and consumer operand have the same sharding.

  Commit: 744c0057e7dc0d1d046a4867cece2f31fee9bb23
      https://github.com/llvm/llvm-project/commit/744c0057e7dc0d1d046a4867cece2f31fee9bb23
  Author: Nashe Mncube <nashe.mncube at arm.com>
  Date:   2024-02-22 (Thu, 22 Feb 2024)

  Changed paths:
    M llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
    A llvm/test/CodeGen/AArch64/16bit-float-promotion-with-nofp.ll
    M llvm/test/CodeGen/AArch64/strictfp_f16_abi_promote.ll

  Log Message:
  -----------
  [AArch64][CodeGen] Fix crash when fptrunc returns fp16 with +nofp attr (#81724)

When performing lowering of the fptrunc opcode returning fp16 with the
+nofp flag enabled we could trigger a compiler crash. This is because we
had no custom lowering implemented. This patch 
the case in which we need to promote an fp16 return type
for fptrunc when the +nofp attr is enabled.

  Commit: 6ddb25ed9ca2cb0f4ad8f402d7411ac3328f598d
      https://github.com/llvm/llvm-project/commit/6ddb25ed9ca2cb0f4ad8f402d7411ac3328f598d
  Author: Florian Mayer <fmayer at google.com>
  Date:   2024-02-22 (Thu, 22 Feb 2024)

  Changed paths:
    M compiler-rt/lib/scudo/standalone/combined.h

  Log Message:
  -----------
  [scudo] increase frames per stack to 16 for stack depot (#82427)

8 was very low and it is likely that in real workloads we have more than
an average of 8 frames per stack given on Android we have 3 at the
bottom: __start_main, __libc_init, main, and three at the top: malloc,
scudo_malloc and Allocator::allocate. That leaves 2 frames for
application code, which is clearly unreasonable.

  Commit: 242f98c7ab7c100d76cac29b555db20205619b38
      https://github.com/llvm/llvm-project/commit/242f98c7ab7c100d76cac29b555db20205619b38
  Author: Benjamin Kramer <benny.kra at googlemail.com>
  Date:   2024-02-22 (Thu, 22 Feb 2024)

  Changed paths:
    M clang/test/CodeGen/aarch64-sme-inline-streaming-attrs.c

  Log Message:
  -----------
  [Clang][SME] Skip writing output files to the source directory

  Commit: 3168af56bcb827360c26957ef579b7871dad8e17
      https://github.com/llvm/llvm-project/commit/3168af56bcb827360c26957ef579b7871dad8e17
  Author: Benjamin Kramer <benny.kra at googlemail.com>
  Date:   2024-02-22 (Thu, 22 Feb 2024)

  Changed paths:
    M llvm/test/Transforms/LoopVectorize/X86/pr72969.ll

  Log Message:
  -----------
  LoopVectorize: Mark crash test as requiring assertions

  Commit: 32994cc0d63513f77223c64148faeeb50aebb702
      https://github.com/llvm/llvm-project/commit/32994cc0d63513f77223c64148faeeb50aebb702
  Author: Alexey Bataev <5361294+alexey-bataev at users.noreply.github.com>
  Date:   2024-02-22 (Thu, 22 Feb 2024)

  Changed paths:
    M llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
    M llvm/test/Transforms/SLPVectorizer/AArch64/extractelements-to-shuffle.ll
    M llvm/test/Transforms/SLPVectorizer/AArch64/reorder-fmuladd-crash.ll
    M llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s116.ll
    M llvm/test/Transforms/SLPVectorizer/AArch64/vec3-reorder-reshuffle.ll
    M llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll
    M llvm/test/Transforms/SLPVectorizer/X86/reduction-transpose.ll
    M llvm/test/Transforms/SLPVectorizer/X86/reorder-clustered-node.ll
    M llvm/test/Transforms/SLPVectorizer/X86/reorder-reused-masked-gather.ll
    M llvm/test/Transforms/SLPVectorizer/X86/reorder-vf-to-resize.ll
    M llvm/test/Transforms/SLPVectorizer/X86/scatter-vectorize-reorder.ll
    M llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
    M llvm/test/Transforms/SLPVectorizer/X86/vec3-reorder-reshuffle.ll

  Log Message:
  -----------
  [SLP]Improve findReusedOrderedScalars and graph rotation.

Patch syncs the code in findReusedOrderedScalars with cost
estimation/codegen. It tries to use similar logic to better determine
best order.
Before, it just tried to find previously vectorized node without
checking if it is possible to use the vectorized value in the shuffle.
Now it relies on the more generalized version. If it determines, that
a single vector must be reordered (using same mechanism, as codegen and
cost estimation), it generates better order.

The comparison between new/ref ordering:

Metric: SLP.NumVectorInstructions

Program                                                                                                                                                SLP.NumVectorInstructions
                                                                                                                                                       results                   results0 diff
                                                                                               test-suite :: MultiSource/Benchmarks/nbench/nbench.test   139.00                    140.00   0.7%
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test   344.00                    346.00   0.6%
                                                                                        test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test  1293.00                   1292.00  -0.1%
                                                                                test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  5176.00                   5170.00  -0.1%
                                                                                        test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  5173.00                   5167.00  -0.1%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 11692.00                  11660.00  -0.3%
                                                                                     test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  1621.00                   1615.00  -0.4%
                                                                                             test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test   795.00                    792.00  -0.4%
                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 26499.00                  26338.00  -0.6%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  7343.00                   7281.00  -0.8%
                                                                                          test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  1104.00                   1094.00  -0.9%
                                                                                          test-suite :: MultiSource/Applications/JM/lencod/lencod.test  2216.00                   2180.00  -1.6%
                                                                                            test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test   787.00                    637.00 -19.1%

Less 0% is better.
Most of the benchmarks see more vectorized code. The first ones just
have shuffles removed.

The ordering analysis still may require some improvements (e.g. for
alternate nodes), but this one should be produce better results.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/77529

  Commit: 7f00130aa6151a9cba3cad2f85516ccb5171d7b0
      https://github.com/llvm/llvm-project/commit/7f00130aa6151a9cba3cad2f85516ccb5171d7b0
  Author: Alexey Bataev <a.bataev at outlook.com>
  Date:   2024-02-22 (Thu, 22 Feb 2024)

  Changed paths:
    M clang/lib/CodeGen/CGCUDANV.cpp
    M clang/lib/CodeGen/CodeGenModule.cpp
    M clang/lib/Driver/OffloadBundler.cpp
    M clang/lib/Driver/ToolChains/HIPUtility.cpp
    M clang/test/CMakeLists.txt
    M clang/test/CodeGen/aarch64-sme-inline-streaming-attrs.c
    M clang/test/CodeGenCUDA/device-stub.cu
    M clang/test/CodeGenCUDA/host-used-device-var.cu
    A clang/test/Driver/Inputs/hip.h
    M clang/test/Driver/clang-offload-bundler.c
    A clang/test/Driver/hip-partial-link.hip
    M clang/test/Driver/hip-toolchain-rdc.hip
    M compiler-rt/lib/scudo/standalone/combined.h
    M llvm/lib/Passes/PassBuilderPipelines.cpp
    M llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
    M llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
    M llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
    A llvm/test/CodeGen/AArch64/16bit-float-promotion-with-nofp.ll
    M llvm/test/CodeGen/AArch64/strictfp_f16_abi_promote.ll
    M llvm/test/Other/new-pm-defaults.ll
    M llvm/test/Other/new-pm-thinlto-postlink-defaults.ll
    M llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
    M llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
    M llvm/test/Other/new-pm-thinlto-prelink-defaults.ll
    M llvm/test/Other/new-pm-thinlto-prelink-pgo-defaults.ll
    M llvm/test/Other/new-pm-thinlto-prelink-samplepgo-defaults.ll
    M llvm/test/Transforms/CorrelatedValuePropagation/min-max.ll
    M llvm/test/Transforms/LoopVectorize/X86/pr72969.ll
    M llvm/test/Transforms/SLPVectorizer/AArch64/extractelements-to-shuffle.ll
    M llvm/test/Transforms/SLPVectorizer/AArch64/reorder-fmuladd-crash.ll
    M llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s116.ll
    M llvm/test/Transforms/SLPVectorizer/AArch64/vec3-reorder-reshuffle.ll
    M llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll
    M llvm/test/Transforms/SLPVectorizer/X86/reduction-transpose.ll
    M llvm/test/Transforms/SLPVectorizer/X86/reorder-clustered-node.ll
    M llvm/test/Transforms/SLPVectorizer/X86/reorder-reused-masked-gather.ll
    M llvm/test/Transforms/SLPVectorizer/X86/reorder-vf-to-resize.ll
    M llvm/test/Transforms/SLPVectorizer/X86/scatter-vectorize-reorder.ll
    M llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
    M llvm/test/Transforms/SLPVectorizer/X86/vec3-reorder-reshuffle.ll
    M mlir/lib/Dialect/Mesh/Transforms/Spmdization.cpp
    M mlir/test/Dialect/Mesh/spmdization.mlir

  Log Message:
  -----------
  Rebase

Created using spr 1.3.5

Compare: https://github.com/llvm/llvm-project/compare/89754078a3fd...7f00130aa615

To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications